Unable to access Impala parquet file

avinash1 · August 2020

Hi Team,
We are loading the data in HDFS through Impala. While trying to access the data in Vertica through hcatalog or external tables getting the below error.

SQL Error [7258] [22V24]: [Vertica]VJDBC ERROR: Failed to read parquet source [hdfs://xyz.com:8020/user/hive/warehouse/parquet_table_name/6142a35057675f5e-294275c300000000_479969595_data.0.parq]: INT_32 can only annotate INT32

Regards,
Avi

Jim_Knicely · August 2020

The data types recognized by Hive differ from the data types recognized by Vertica. Check the doc for how the HCatalog Connector converts Hive data types into data types compatible with Vertica here:

https://www.vertica.com/docs/latest/HTML/Content/Authoring/HadoopIntegrationGuide/HCatalogConnector/HCatalogDataTypeConversions.htm

SruthiA · August 2020

Since it is complaining about the data types. please check if you have defined right data types in your create external table definition. One way to find the data types of the data present in parquet files is by using INFER_EXTERNAL_TABLE_DDL function provided by vertica. Please find the below link which has example pertaining to it. Once you get the output, compare it with your current external table definition being used and see if there are any differences

https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/INFER_EXTERNAL_TABLE_DDL.htm

avinash1 · August 2020

Hi,
I have checked with dummy table in HIVE with datatype bigint and string and inserted data from both HIVE and IMPALA.
Found the data inserted from hive is accessible in Vertica but data inserted from impala is not accessible Vertica.

DDL:
CREATE EXTERNAL TABLE impalatest(
file_id bigint)
PARTITIONED BY (
insertdb string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://xyz/user/hive/warehouse/tb.db/impalatest'
TBLPROPERTIES (
'PARQUET.COMPRESS'='SNAPPY',
'transient_lastDdlTime'='1598622306')
Time taken: 0.078 seconds, Fetched: 15 row(s)

Data inserted from Hive:
hive> insert into impalatest partition(insertdb='hive') values(123456789);
Data inserted from Impala:
insert into impalatest partition(insertdb='impala') values(1234567890);

hive> select * from impalatest;
OK
123456789 hive

1234567890 impala

while selecting the hive partition in Vertica through hcatalg, it is working
select * from hcat_Default.impalatest where insertdb='hive';
Output : 123456789 hive

while selecting the impala partition in Vertica through hcatalg, it is giving error:
select * from hcat_Default.impalatest where insertdb='impala';
SQL Error [7258] [22V24]: [Vertica]VJDBC ERROR: Failed to read parquet source [hdfs://xyz/user/hive/warehouse/bdphtb.db/impalatest/insertdb%3Dimpala/c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq]: INT_32 can only annotate INT32

Is there any difference the Vertica reads the parquet generated in HIVE and IMPALA.

Regards,
Avi

avinash1 · August 2020

Hi,

We took the file from hdfs and while checking the meta of both the parquet files we got some differences file_id
in hive it showing as UNCOMPRESSED and in impala its SNAPPY . Does this makes any difference.

[cloudera@a0410phadoops10 tmp]$ hdfs dfs -ls /user/hive/warehouse/xyx/impalatest
Found 3 items
drwxrwxrwx - impala hive 0 2020-08-28 19:56 /user/hive/warehouse/xyz/impalatest/_impala_insert_staging
drwxrwxrwx - cloudera hive 0 2020-08-28 19:48 /user/hive/warehouse/xyz/impalatest/insertdb=hive
drwxr-xr-x - impala hive 0 2020-08-28 19:56 /user/hive/warehouse/xyz/impalatest/insertdb=impala

Meta data information:

Hive Partition file:

[cloudera@a0410phadoops10 impala_vertica_test]$ parquet-tools meta 000000_1
file: file:/tmp/impala_vertica_test/000000_1
creator: parquet-mr version 1.9.0-cdh6.1.1 (build ${buildNumber})
extra: writer.model.name = 2.1.1-cdh6.1.1

file schema: hive_schema

file_id: OPTIONAL INT64 R:0 D:1

row group 1: RC:1 TS:75 OFFSET:4

file_id: INT64 UNCOMPRESSED DO:0 FPO:4 SZ:75/75/1.00 VC:1 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 123456789, max: 123456789, num_nulls: 0]

=================================================
Impala Partition file:

[cloudera@a0410phadoops10 impala_vertica_test]$ parquet-tools meta c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq
file: file:/tmp/impala_vertica_test/c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq
creator: impala version 3.1.0-cdh6.1.1 (build 97215ce79febfa42364dbff8e4c4d3c5bfc583ba)

file schema: schema

file_id: OPTIONAL INT64 O:INT_64 R:0 D:1

row group 1: RC:1 TS:51 OFFSET:4

file_id: INT64 SNAPPY DO:4 FPO:27 SZ:51/47/0.92 VC:1 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1234567890, max: 1234567890, num_nulls: 0]
[cloudera@a0410phadoops10 impala_vertica_test]$

Regards,
Avi

SruthiA · August 2020

If the compression is snappy in IMPALA, you need access to the Snappy native libraries, libhadoop.so and libsnappy.so..Could you please review the below link?

https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/HadoopIntegrationGuide/HCatalogConnector/ConfiguringVerticaForHCatalog.htm

poojan · September 2020

Which version of Vertica are you using?
Vertica 9.2.1-1 and higher has this fix that lets INT_32 annotate more primitive types in parquet file. For more information refer: https://www.vertica.com/docs/ReleaseNotes/9.2.x/Vertica_9.2.x_Release_Notes.htm

We're Moving!

Create My New Community Account Now

Unable to access Impala parquet file

Answers

1234567890 impala

file schema: hive_schema

row group 1: RC:1 TS:75 OFFSET:4

file schema: schema

row group 1: RC:1 TS:51 OFFSET:4

Leave a Comment