Unable to access Impala parquet file
Hi Team,
We are loading the data in HDFS through Impala. While trying to access the data in Vertica through hcatalog or external tables getting the below error.
SQL Error [7258] [22V24]: [Vertica]VJDBC ERROR: Failed to read parquet source [hdfs://xyz.com:8020/user/hive/warehouse/parquet_table_name/6142a35057675f5e-294275c300000000_479969595_data.0.parq]: INT_32 can only annotate INT32
Regards,
Avi
Tagged:
0
Answers
The data types recognized by Hive differ from the data types recognized by Vertica. Check the doc for how the HCatalog Connector converts Hive data types into data types compatible with Vertica here:
https://www.vertica.com/docs/latest/HTML/Content/Authoring/HadoopIntegrationGuide/HCatalogConnector/HCatalogDataTypeConversions.htm
Since it is complaining about the data types. please check if you have defined right data types in your create external table definition. One way to find the data types of the data present in parquet files is by using INFER_EXTERNAL_TABLE_DDL function provided by vertica. Please find the below link which has example pertaining to it. Once you get the output, compare it with your current external table definition being used and see if there are any differences
https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/INFER_EXTERNAL_TABLE_DDL.htm
Hi,
I have checked with dummy table in HIVE with datatype bigint and string and inserted data from both HIVE and IMPALA.
Found the data inserted from hive is accessible in Vertica but data inserted from impala is not accessible Vertica.
DDL:
CREATE EXTERNAL TABLE
impalatest
(file_id
bigint)PARTITIONED BY (
insertdb
string)ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://xyz/user/hive/warehouse/tb.db/impalatest'
TBLPROPERTIES (
'PARQUET.COMPRESS'='SNAPPY',
'transient_lastDdlTime'='1598622306')
Time taken: 0.078 seconds, Fetched: 15 row(s)
Data inserted from Hive:
hive> insert into impalatest partition(insertdb='hive') values(123456789);
Data inserted from Impala:
insert into impalatest partition(insertdb='impala') values(1234567890);
hive> select * from impalatest;
OK
123456789 hive
1234567890 impala
while selecting the hive partition in Vertica through hcatalg, it is working
select * from hcat_Default.impalatest where insertdb='hive';
Output : 123456789 hive
while selecting the impala partition in Vertica through hcatalg, it is giving error:
select * from hcat_Default.impalatest where insertdb='impala';
SQL Error [7258] [22V24]: [Vertica]VJDBC ERROR: Failed to read parquet source [hdfs://xyz/user/hive/warehouse/bdphtb.db/impalatest/insertdb%3Dimpala/c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq]: INT_32 can only annotate INT32
Is there any difference the Vertica reads the parquet generated in HIVE and IMPALA.
Regards,
Avi
Hi,
We took the file from hdfs and while checking the meta of both the parquet files we got some differences file_id
in hive it showing as UNCOMPRESSED and in impala its SNAPPY . Does this makes any difference.
[cloudera@a0410phadoops10 tmp]$ hdfs dfs -ls /user/hive/warehouse/xyx/impalatest
Found 3 items
drwxrwxrwx - impala hive 0 2020-08-28 19:56 /user/hive/warehouse/xyz/impalatest/_impala_insert_staging
drwxrwxrwx - cloudera hive 0 2020-08-28 19:48 /user/hive/warehouse/xyz/impalatest/insertdb=hive
drwxr-xr-x - impala hive 0 2020-08-28 19:56 /user/hive/warehouse/xyz/impalatest/insertdb=impala
Meta data information:
Hive Partition file:
[cloudera@a0410phadoops10 impala_vertica_test]$ parquet-tools meta 000000_1
file: file:/tmp/impala_vertica_test/000000_1
creator: parquet-mr version 1.9.0-cdh6.1.1 (build ${buildNumber})
extra: writer.model.name = 2.1.1-cdh6.1.1
file schema: hive_schema
file_id: OPTIONAL INT64 R:0 D:1
row group 1: RC:1 TS:75 OFFSET:4
file_id: INT64 UNCOMPRESSED DO:0 FPO:4 SZ:75/75/1.00 VC:1 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 123456789, max: 123456789, num_nulls: 0]
=================================================
Impala Partition file:
[cloudera@a0410phadoops10 impala_vertica_test]$ parquet-tools meta c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq
file: file:/tmp/impala_vertica_test/c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq
creator: impala version 3.1.0-cdh6.1.1 (build 97215ce79febfa42364dbff8e4c4d3c5bfc583ba)
file schema: schema
file_id: OPTIONAL INT64 O:INT_64 R:0 D:1
row group 1: RC:1 TS:51 OFFSET:4
file_id: INT64 SNAPPY DO:4 FPO:27 SZ:51/47/0.92 VC:1 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1234567890, max: 1234567890, num_nulls: 0]
[cloudera@a0410phadoops10 impala_vertica_test]$
Regards,
Avi
If the compression is snappy in IMPALA, you need access to the Snappy native libraries, libhadoop.so and libsnappy.so..Could you please review the below link?
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/HadoopIntegrationGuide/HCatalogConnector/ConfiguringVerticaForHCatalog.htm
Which version of Vertica are you using?
Vertica 9.2.1-1 and higher has this fix that lets INT_32 annotate more primitive types in parquet file. For more information refer: https://www.vertica.com/docs/ReleaseNotes/9.2.x/Vertica_9.2.x_Release_Notes.htm