"where" when selecting from external table(HDFS), not working with FLOAT, BIGINT e.t.c.

Hello

  1. I create table in hadoop(cloudera quickstart 5.13) parquet format, with 4 column: TICKET_OID (BIGINT),
    NOMER_ZAYAVKI(STRING),
    LAT(DOUBLE),
    LON(DOUBLE).
    Filling with Data, and select without problem(by impala) :

select * from obj_ticket_pq where LON > 37.555848 and LAT < 55.892728 and LON < 38.203597 and LAT > 55.568758;
return 100 records;

  1. I crate in vertica(8.1.1) external table from HDFS
    CREATE EXTERNAL TABLE test (TICKET_OID BIGINT, NOMER_ZAYAVKI VARCHAR(256), LAT DOUBLE PRECISION, LON DOUBLE PRECISION) AS COPY FROM 'hdfs:///data/OBJ_TICKET/pq/*' ON ANY NODE PARQUET;

select * from test; -work!
select * from test where LON > 37.555848 ;- not work!!! - result 0 column;

also if I try to put in "where" a condition on TICKET_OID(BIGINT) - "greater, less or equally" result of a select of 0 records

if I put conditions(equal, like etc.) on NOMER_ZAYAVKI(STRING) everything works correctly.

How do I make "where" work with numbers?

Comments

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Can you show the output of your SELECT * FROM test; in Vertica? Maybe with a LIMIT of 5?

    My simple test seems to work fine, although I am reading the Parquet data locally as opposed to from hdfs.

    dbadmin=> SELECT * FROM test;
     TICKET_OID | NOMER_ZAYAVKI |   LAT    |   LON
    ------------+---------------+----------+---------
              1 | TEST          | 41.40338 | 2.17403
    (1 row)
    
    dbadmin=> SELECT * FROM test WHERE lat > 37.555848;
     TICKET_OID | NOMER_ZAYAVKI |   LAT    |   LON
    ------------+---------------+----------+---------
              1 | TEST          | 41.40338 | 2.17403
    (1 row)
    
    dbadmin=> export to parquet (directory='/home/dbadmin/testparq') AS SELECT * FROM test;
     Rows Exported
    ---------------
                 1
    (1 row)
    
    dbadmin=> CREATE EXTERNAL TABLE test_ext (TICKET_OID BIGINT, NOMER_ZAYAVKI VARCHAR(256), LAT DOUBLE PRECISION, LON DOUBLE PRECISION) AS COPY FROM '/home/dbadmin/testparq/*.parquet*' ON ANY NODE PARQUET;
    CREATE TABLE
    
    dbadmin=> SELECT * FROM test_ext;
     TICKET_OID | NOMER_ZAYAVKI |   LAT    |   LON
    ------------+---------------+----------+---------
              1 | TEST          | 41.40338 | 2.17403
    (1 row)
    
    dbadmin=> SELECT * FROM test WHERE lat > 37.555848;
     TICKET_OID | NOMER_ZAYAVKI |   LAT    |   LON
    ------------+---------------+----------+---------
              1 | TEST          | 41.40338 | 2.17403
    (1 row)
    
  • edited January 2019

    dbadmin=> select TICKET_OID, NOMER_ZAYAVKI, LON,LAT from test limit 5;
    TICKET_OID | NOMER_ZAYAVKI | LON | LAT
    ------------+---------------+-----------+-----------
    150529 | TT8397782 | 99.423919 | 56.076368
    111766 | TT8163533 | |
    115562 | TT8184930 | 65.328574 | 55.515793
    115563 | TT8184944 | 37.55664 | 55.43101
    109896 | TT8152791 | 34.290248 | 53.325734
    (5 rows)

    hm...
    I repeated your test and everything work fine. The table was created, export to hdfs in paruet format. And then connected back to vertica as external.
    "select...where" works correctly.

    Probably in my case a wrong parquet format? I'm creating "bad" table on hdfs from impala.

    I will try to compare the table files created in the Impala and in the vertica to understand the difference.
    Thanks for the tip

  • I conducted an experiment on create data to hdfs in parquet format.
    Create table on hdfs from vertica, hive, impala in same data:

    Create external table from hdfs on 3 tables, and select this:
    dbadmin=> select * from test_parquet_from_vertica;
    TICKET_OID | NOMER_ZAYAVKI | LAT | LON
    ------------+---------------+----------+----------
    315131 | TT8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_hive;
    ticket_oid | nomer_zayavki | lat | lon
    ------------+---------------+----------+----------
    315131 | tt8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_impala;
    ticket_oid | nomer_zayavki | lat | lon
    ------------+---------------+----------+----------
    315131 | TT8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_vertica where lat > 10;
    TICKET_OID | NOMER_ZAYAVKI | LAT | LON
    ------------+---------------+----------+----------
    315131 | TT8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_hive where lat > 10;
    ticket_oid | nomer_zayavki | lat | lon
    ------------+---------------+----------+----------
    315131 | tt8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_impala where lat > 10;
    ticket_oid | nomer_zayavki | lat | lon
    ------------+---------------+-----+-----
    (0 rows)

    Info from parquet schema:

    [root@quickstart lib]# sudo -u hdfs parquet-tools schema hdfs://localhost/data/test/pq_vertica
    message schema {
    optional int64 TICKET_OID;
    optional binary NOMER_ZAYAVKI (UTF8);
    optional double LAT;
    optional double LON;
    }

    [root@quickstart lib]# sudo -u hdfs parquet-tools schema hdfs://localhost/data/test/pq_hive
    message hive_schema {
    optional int64 ticket_oid;
    optional binary nomer_zayavki (UTF8);
    optional double lat;
    optional double lon;
    }

    [root@quickstart lib]# sudo -u hdfs parquet-tools schema hdfs://localhost/data/test/pq_impala
    message schema {
    optional int64 ticket_oid;
    optional binary nomer_zayavki;
    optional double lat;
    optional double lon;
    }

    The only difference in nomer_zayavki(UTF8).

    Imapala version 2.10 creates a parquet file not compatible with with Vertica?

  • I created 3 tables on hdfs with the same data, but created them from different programs: vertica, hive, impala
    Connect them as external to vertica:
    dbadmin=> select * from test_parquet_from_vertica;
    TICKET_OID | NOMER_ZAYAVKI | LAT | LON
    ------------+---------------+----------+----------
    315131 | TT8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_hive;
    ticket_oid | nomer_zayavki | lat | lon
    ------------+---------------+----------+----------
    315131 | tt8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_impala;
    ticket_oid | nomer_zayavki | lat | lon
    ------------+---------------+----------+----------
    315131 | TT8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_vertica where lat > 10;
    TICKET_OID | NOMER_ZAYAVKI | LAT | LON
    ------------+---------------+----------+----------
    315131 | TT8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_hive where lat > 10;
    ticket_oid | nomer_zayavki | lat | lon
    ------------+---------------+----------+----------
    315131 | tt8950718 | 55.74702 | 37.54171
    (1 row)

    dbadmin=> select * from test_parquet_from_impala where lat > 10;
    ticket_oid | nomer_zayavki | lat | lon
    ------------+---------------+-----+-----
    (0 rows)

    On HDFS:
    [root@quickstart lib]# sudo -u hdfs parquet-tools schema hdfs://localhost/data/test/pq_vertica
    message schema {
    optional int64 TICKET_OID;
    optional binary NOMER_ZAYAVKI (UTF8);
    optional double LAT;
    optional double LON;
    }

    [root@quickstart lib]# sudo -u hdfs parquet-tools schema hdfs://localhost/data/test/pq_hive
    message hive_schema {
    optional int64 ticket_oid;
    optional binary nomer_zayavki (UTF8);
    optional double lat;
    optional double lon;
    }

    [root@quickstart lib]# sudo -u hdfs parquet-tools schema hdfs://localhost/data/test/pq_impala
    message schema {
    optional int64 ticket_oid;
    optional binary nomer_zayavki;
    optional double lat;
    optional double lon;
    }

    Perhaps imapala(2.10) is not compatible with the vertica(8.1.1)?

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited January 2019

    Hi,

    I wonder if this is a data type incompatibility/casting issue?

    For the Impala table, can you try these SELECT statements?

    select * from test_parquet_from_impala where lat::INT > 10::INT;
    select * from test_parquet_from_impala where lat::NUMERIC> 10::NUMERIC;
    

    Note:
    Impala 2.1 was introduced in CDH 5.3.

    According to the Vertica 8.1.1 doc, Vertica was tested on CDH 5.7 and 5.8.

    See:
    https://www.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/SupportedPlatforms/IntegrationsForHadoop.htm

    And Vertica 9.2 has been tested on CDH 5.11, 5.12 and 5.14.

    See:
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SupportedPlatforms/IntegrationsForHadoop.htm

  • Yes!!!
    Both work.
    Thank you very much for the help!

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited January 2019

    Cool! Note that I used the INT example just as an example. Don't use it for casting a double :) It will ROUND your number.

    dbadmin=> SELECT 55.74702::INT rounded;
     rounded
    ---------
          56
    (1 row)
    
  • Of course I understand
    Thanks again :)

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file