Unable to access Impala parquet file

Hi Team,
We are loading the data in HDFS through Impala. While trying to access the data in Vertica through hcatalog or external tables getting the below error.

SQL Error [7258] [22V24]: [Vertica]VJDBC ERROR: Failed to read parquet source [hdfs://xyz.com:8020/user/hive/warehouse/parquet_table_name/6142a35057675f5e-294275c300000000_479969595_data.0.parq]: INT_32 can only annotate INT32

Regards,
Avi

Tagged:

Answers

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    The data types recognized by Hive differ from the data types recognized by Vertica. Check the doc for how the HCatalog Connector converts Hive data types into data types compatible with Vertica here:

    https://www.vertica.com/docs/latest/HTML/Content/Authoring/HadoopIntegrationGuide/HCatalogConnector/HCatalogDataTypeConversions.htm

  • SruthiASruthiA Vertica Employee Administrator
    edited August 2020

    Since it is complaining about the data types. please check if you have defined right data types in your create external table definition. One way to find the data types of the data present in parquet files is by using INFER_EXTERNAL_TABLE_DDL function provided by vertica. Please find the below link which has example pertaining to it. Once you get the output, compare it with your current external table definition being used and see if there are any differences

    https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/SQLReferenceManual/Functions/VerticaFunctions/INFER_EXTERNAL_TABLE_DDL.htm

  • Hi,
    I have checked with dummy table in HIVE with datatype bigint and string and inserted data from both HIVE and IMPALA.
    Found the data inserted from hive is accessible in Vertica but data inserted from impala is not accessible Vertica.

    DDL:
    CREATE EXTERNAL TABLE impalatest(
    file_id bigint)
    PARTITIONED BY (
    insertdb string)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
    'hdfs://xyz/user/hive/warehouse/tb.db/impalatest'
    TBLPROPERTIES (
    'PARQUET.COMPRESS'='SNAPPY',
    'transient_lastDdlTime'='1598622306')
    Time taken: 0.078 seconds, Fetched: 15 row(s)

    Data inserted from Hive:
    hive> insert into impalatest partition(insertdb='hive') values(123456789);
    Data inserted from Impala:
    insert into impalatest partition(insertdb='impala') values(1234567890);

    hive> select * from impalatest;
    OK
    123456789 hive

    1234567890 impala

    while selecting the hive partition in Vertica through hcatalg, it is working
    select * from hcat_Default.impalatest where insertdb='hive';
    Output : 123456789 hive

    while selecting the impala partition in Vertica through hcatalg, it is giving error:
    select * from hcat_Default.impalatest where insertdb='impala';
    SQL Error [7258] [22V24]: [Vertica]VJDBC ERROR: Failed to read parquet source [hdfs://xyz/user/hive/warehouse/bdphtb.db/impalatest/insertdb%3Dimpala/c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq]: INT_32 can only annotate INT32

    Is there any difference the Vertica reads the parquet generated in HIVE and IMPALA.

    Regards,
    Avi

  • Hi,

    We took the file from hdfs and while checking the meta of both the parquet files we got some differences file_id
    in hive it showing as UNCOMPRESSED and in impala its SNAPPY . Does this makes any difference.

    [cloudera@a0410phadoops10 tmp]$ hdfs dfs -ls /user/hive/warehouse/xyx/impalatest
    Found 3 items
    drwxrwxrwx - impala hive 0 2020-08-28 19:56 /user/hive/warehouse/xyz/impalatest/_impala_insert_staging
    drwxrwxrwx - cloudera hive 0 2020-08-28 19:48 /user/hive/warehouse/xyz/impalatest/insertdb=hive
    drwxr-xr-x - impala hive 0 2020-08-28 19:56 /user/hive/warehouse/xyz/impalatest/insertdb=impala

    Meta data information:

    Hive Partition file:

    [cloudera@a0410phadoops10 impala_vertica_test]$ parquet-tools meta 000000_1
    file: file:/tmp/impala_vertica_test/000000_1
    creator: parquet-mr version 1.9.0-cdh6.1.1 (build ${buildNumber})
    extra: writer.model.name = 2.1.1-cdh6.1.1

    file schema: hive_schema

    file_id: OPTIONAL INT64 R:0 D:1

    row group 1: RC:1 TS:75 OFFSET:4

    file_id: INT64 UNCOMPRESSED DO:0 FPO:4 SZ:75/75/1.00 VC:1 ENC:BIT_PACKED,PLAIN,RLE ST:[min: 123456789, max: 123456789, num_nulls: 0]

    =================================================
    Impala Partition file:

    [cloudera@a0410phadoops10 impala_vertica_test]$ parquet-tools meta c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq
    file: file:/tmp/impala_vertica_test/c648f21952b982b0-9eb4172d00000000_106750020_data.0.parq
    creator: impala version 3.1.0-cdh6.1.1 (build 97215ce79febfa42364dbff8e4c4d3c5bfc583ba)

    file schema: schema

    file_id: OPTIONAL INT64 O:INT_64 R:0 D:1

    row group 1: RC:1 TS:51 OFFSET:4

    file_id: INT64 SNAPPY DO:4 FPO:27 SZ:51/47/0.92 VC:1 ENC:PLAIN_DICTIONARY,RLE ST:[min: 1234567890, max: 1234567890, num_nulls: 0]
    [cloudera@a0410phadoops10 impala_vertica_test]$

    Regards,
    Avi

  • SruthiASruthiA Vertica Employee Administrator
    edited August 2020

    If the compression is snappy in IMPALA, you need access to the Snappy native libraries, libhadoop.so and libsnappy.so..Could you please review the below link?

    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/HadoopIntegrationGuide/HCatalogConnector/ConfiguringVerticaForHCatalog.htm

  • poojanpoojan Vertica Employee Employee

    Which version of Vertica are you using?
    Vertica 9.2.1-1 and higher has this fix that lets INT_32 annotate more primitive types in parquet file. For more information refer: https://www.vertica.com/docs/ReleaseNotes/9.2.x/Vertica_9.2.x_Release_Notes.htm

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file