Query Parquet data through Vertica (Vertica Hadoop Integration)

Jesse · May 2017

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files on HDFS. My goal is to query those files using Vertica.

Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:

CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;

Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration

I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.

My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?

swalkaus · May 2017

HCatalog connector might be useful for synchronizing HIVE metadata (e.g. table definitions) into Vertica but it won't improve query performance.

If you can partition your parquet data then partition pruning may improve performance significantly.
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/NativeFormats/ReadingNativeFormats.htm?Highlight=parquet

Try profiling your query and use the execution_engine_profiles system table to identify performance bottlenecks.
https://my.vertica.com/docs/8.0.x/HTML/index.htm#Authoring/AdministratorsGuide/Profiling/HowToProfileASingleStatement.htm

Start by measuring how fast Vertica is reading from hdfs (you can query execution_engine_profiles while your statement is executing). For a simple query something like this might do the trick:
=> PROFILE SELECT ...;
NOTICE 4788: Statement is being profiled
HINT: Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273708090 and statement_id=1;

=> SELECT b.node_name,b.operator_name,mb/ms AS mb_per_ms
  FROM
    (SELECT node_name,operator_name,counter_name, SUM(counter_value) AS mb
       FROM v_monitor.execution_engine_profiles
      WHERE transaction_id=45035996273708090 AND statement_id=1
        AND counter_name in ('Hadoop FS bytes read through webhdfs', 'Hadoop FS bytes read through native libhdfs++ client')
        AND operator_name = 'Load'
     GROUP BY 1,2,3
    ) AS b
  JOIN
    (SELECT node_name,operator_name,counter_name, SUM(counter_value) AS ms
       FROM v_monitor.execution_engine_profiles
      WHERE transaction_id=45035996273708090 AND statement_id=1
        AND counter_name in ('clock time (us)') AND operator_name ='Load'
     GROUP BY 1,2,3
    ) AS s
  USING (node_name,operator_name);

mcellio · May 2017

This question was also asked on Stack Overflow, where I answered as follows (at the time I thought the asker was using 8.0 not 7.2.3):

Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)

Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.

You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).

Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:

CREATE EXTERNAL TABLE t (id int, name varchar(50), 
                        created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*' 
PARQUET(hive_partition_cols='created,region');

We're Moving!

Create My New Community Account Now

Query Parquet data through Vertica (Vertica Hadoop Integration)

Comments

Leave a Comment