How to copy data from hive table (with partitions) to vertica table?
I have data in a table in hive which has a single partition (date string). I am using the COPY command in vsql to move the data from there (with webhdfs URL to hive table) to a table in vertica. I am faced with the two scenarios -
1. Vertica table has no partition column (so one column less than hive table) - When I issue COPY command, the data gets copied without issues.
2. Vertica table is also partitioned with date string column (hive and vertica have same number of columns now) - When I try to COPY in this case, all the records get rejected and copy fails. Here is the error that I get - ERROR 2006: COPY: 10000 records have been rejected
The exact COPY command that I am issuing goes like this -
COPY vertica_table SOURCE Hdfs(url='http://url:50070/webhdfs/v1/user/user/warehouse/hive_db.db/hive_table/date=20170525/*') DELIMITER E'\001' NULL AS '' NO ESCAPE REJECTMAX 10000 DIRECT;
Can someone please help on how to get the data copied with partitions?
Thanks!
Answers
Hi anand_soni,
There are a couple of options to can consider to deal with Hive partitions:
1. If you use ORC/PARQUET format to store data for your Hive tables, then the best option for you is to use our built-in ORC/PARQUET readers with partition column support. Syntax is like:
COPY from 'hdfs:///user/user_warehouse/hive_db.db/hive_table//' ORC(hive_partition_cols='date');
Note that your Vertica table DDL must also contain column "date" in the above case.
See more details: http://documentation.verticacorp.com/trunk/HTML/index.htm#Authoring/HadoopIntegrationGuide/NativeFormats/ReadingNativeFormats.htm
CREATE HCATALOG SCHEMA hcat_s ...(more options);
select * from hcat_s.hive_table where date='20170525';
Documentation details: http://documentation.verticacorp.com/trunk/HTML/index.htm#Authoring/HadoopIntegrationGuide/HCatalogConnector/UsingTheHCatalogConnector.htm?TocPath=Integrating%20with%20Apache%20Hadoop|Using%20the%20HCatalog%20Connector|_____0
HCatalog connector automatically create a temp external table when you query that table and return results then destroy the temp table. So if you only want to query a Hive table, this is the solution, which works for all formats. But if the table is large, the performance may suffer depending on the file format, because internally it uses Java Serdes to load and parse data.
So if possible, convert your Hive table to ORC/PARQUET and use #1 as priority, which has much better performance.