Options

Moving data of type MAP (JSON data) from Hive table to Vertica distorts the MAP data?

I have a table in Hive which has a column of type MAP<String, String> which is basically json data. Eg. : {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}

As per the Vertica official documentation MAP data in Hive converts to VARCHAR in Vertica. So, I have a corresponding table in Vertica with column type VARCHAR. Now, I am trying to copy this data from Hive table to Vertica table issuing the Vertica COPY command. All data gets copied successfully but the JSON data from hive is copied distorted like this -

y1Falsey2Falsey352402292y4d90fce80-b352-4aca-8d83-6e997e48d917 . As seen here, the JSON format is lost completely. I get a plain string instead in the VARCHAR column.

The command that I issue is -

COPY vertica_table SOURCE Hdfs(url='http:url:50070/webhdfs/v1/user/warehouse/hive_db.db/hive_table_name/part=20170530/*') DELIMITER E'\001' NULL AS '' NO ESCAPE REJECTMAX 10000 DIRECT;

Can someone help me understand what is going wrong during copy? Did I miss something here? Thanks for help.

Answers

  • Options
    skeswaniskeswani - Select Field - Employee

    I am not sure what specific documentation you are referring to.
    Also I am unable to simulate the issue you are referring to.

    as per this documentation
    https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunctions/FlexTables/mapLookup.htm

    is this what you are trying to do

    `
    1.
    dbadmin=> CREATE table coljson(id IDENTITY(1,1), json varchar(128), vmap long varbinary(10000));
    CREATE TABLE
    dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) FROM '/home/dbadmin/file.json';

    Rows Loaded

           1
    

    (1 row)

    1. There is nothing special about loading from hadoop or from file.

    dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) SOURCE Hdfs(url='https://ip-10-11-12-40.us-east-2.compute.internal:50470/webhdfs/v1/user/user/file.json') ;
    WARNING 7768: HDFS Connector is deprecated. Use 'COPY FROM' syntax directly with 'hdfs://' scheme URLs

    Rows Loaded

           1
    

    (1 row)

    dbadmin=> select id,json from coljson ;
    id | json
    --------+------------------------------------------------------------------------------------------
    1 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    250001 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    (2 rows)

    Now you can parse out the json into various columns...

    dbadmin=> select MapLookup(vmap, 'y1' USING PARAMETERS buffer_size=0) y1, MapLookup(vmap, 'y2' USING PARAMETERS buffer_size=0) y2 ,MapLookup(vmap, 'y3' USING PARAMETERS buffer_size=0) y3 ,MapLookup(vmap, 'y4' USING PARAMETERS buffer_size=0) y4 from coljson ;
    y1 | y2 | y3 | y4
    -------+-------+----------+--------------------------------------
    False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
    False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
    (2 rows)

    `

    There are various functions to materialize the individual cols from the json, should you want to persist them independently of the json string

  • Options
    skeswaniskeswani - Select Field - Employee

    I am unable to reproduce the problem you mention here. can you point me to the documentation you are referring to.

    '
    cat /tmp/file.json
    {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}

    1.
    dbadmin=> CREATE table coljson(id IDENTITY(1,1), json varchar(128), vmap long varbinary(10000));
    CREATE TABLE
    dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) FROM '/home/dbadmin/file.json';
    Rows Loaded

    dbadmin=> select id,json from coljson ;
    id | json
    ----+------------------------------------------------------------------------------------------
    1 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    (1 row)

    1. There is nothing special about loading from hadoop or from file.

    dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) SOURCE Hdfs(url='https://ip-10-11-12-40.us-east-2.compute.internal:50470/webhdfs/v1/user/user/file.json') ;
    WARNING 7768: HDFS Connector is deprecated. Use 'COPY FROM' syntax directly with 'hdfs://' scheme URLs
    Rows Loaded

    dbadmin=> select id, json from coljson ;
    id | json
    --------+------------------------------------------------------------------------------------------
    1 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    250001 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    (2 rows)

    For simplicity i will use a file, but hdfs source is no different

    one way to do it would be the following

    dbadmin=> select MapLookup(vmap, 'y1' USING PARAMETERS buffer_size=0) y1, MapLookup(vmap, 'y2' USING PARAMETERS buffer_size=0) y2 ,MapLookup(vmap, 'y3' USING PARAMETERS buffer_size=0) y3 ,MapLookup(vmap, 'y4' USING PARAMETERS buffer_size=0) y4 from coljson ;
    y1 | y2 | y3 | y4
    -------+-------+----------+--------------------------------------
    False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
    False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
    (2 rows)

    another way would be

    dbadmin=> create table X (y1 varchar(200), y2 varchar(200), y3 varchar(300) );
    CREATE TABLE

    dbadmin=> COPY X (json filler varchar (64000), y1 AS cast (MapLookup(MapJSONExtractor(json), 'y1' USING PARAMETERS buffer_size=0) as varchar(200)), y2 AS cast (MapLookup(MapJSONExtractor(json), 'y2' ) as varchar(200)), y3 AS cast (MapLookup(MapJSONExtractor(json), 'y3' ) as varchar(200))) from '/tmp/file.json';

    skeswani=> select * from X;
    y1 | y2 | y3
    -------+-------+----------
    False | False | 52402292
    (1 row)

    `

  • Options
    kguankguan Employee

    anand_soni, is this the document you mentioned in your description?https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/HCatalogConnector/HCatalogDataTypeConversions.htm

    This is actually the result of using HCatalog Connector to access data in Hive database.

    Below is another way to use HCatalog Connector in addition to skeswani's solution:
    1. suppose you have table map_t in Hive default database :
    hive> select * from default.map_t;
    OK
    1 {"y1":"False"}
    2 {"y2":"False"}
    3 {"y3":"52402292"}
    4 {"y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    Time taken: 0.501 seconds, Fetched: 4 row(s)

    1. then you can create a Vertica HCatalog schema named "hcat_map" with Hive database "default"
      => create hcatalog schema hcat_map with hcatalog_schema = 'default';
      CREATE SCHEMA

    2. you can query the table inside Vertica directly:
      => select * from hcat_map.map_t;
      test_id | metrics
      ---------+-----------------------------------------------
      2 | {"y2":"False"}
      3 | {"y3":"52402292"}
      4 | {"y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
      1 | {"y1":"False"}
      (4 rows)

    Note that you need to have WebHCat (a Hive Service) installed to support HCatalog Connector.

  • Options

    Thanks @skeswani and @kguan !
    I figured out the actual problem now. Actually, the problem is not with vertica COPY command. I am reading data directly using the datafile present in webhdfs location. The file doesn't contain the data in JSON format but has it as the unformatted string : y1Falsey2Falsey352402292y4d90fce80-b352-4aca-8d83-6e997e48d917 (as I mentioned in the question). So, I think Hive has an internal logic to convert this data to JSON format somehow while showing the data in hive table (running SELECT query from hive will print the same data from same location in JSON format : {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}).

    Is there a way to process this raw data into Hive JSON format before running the vertica COPY command? That would solve my problem.

  • Options
    kguankguan Employee

    anand_soni, I'm glad you found the root cause of the problem. The hive raw file is for Hive only and usually contains only part of the table, so it's not suppose to be directly access (COPY) by Vertica.

    If you want to have a Vertica copy of the Hive table, you can use SYNC_WITH_HCATALOG_SCHEMA or SYNC_WITH_HCATALOG_SCHEMA_TABLE:
    1. create a Vertica schema:
    => create schema local_schema;
    CREATE SCHEMA
    2. sync the whole Hive database to Vertica:
    => select sync_with_hcatalog_schema('local_schema', 'hcat_map');
    Schema local_schema synchronized with hcat_map
    tables in hcat_map = 7
    tables altered in default_schema = 0
    tables created in default_schema = 7
    stale tables in default_schema = 0
    table changes erred in default_schema = 0
    3. or sync a single table:
    => select sync_with_hcatalog_schema_table('local_schema', 'hcat_map', 'map_t' );
    Schema local_schema synchronized with hcat_map for table map_t
    table map_t is created in schema local_schema

    Then you can query the table in a local Vertica schema:
    => select * from local_schema.map_t;
    3 | {"y3":"52402292"}
    4 | {"y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    1 | {"y1":"False"}
    2 | {"y2":"False"}

    Hope this would help to resolve your problem.

  • Options
    skeswaniskeswani - Select Field - Employee

    I am not sure what specific documentation you are referring to.
    Also I am unable to simulate the issue you are referring to.

    as per this documentation
    https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunctions/FlexTables/mapLookup.htm

    is this what you are trying to do

    ``
    1.
    dbadmin=> CREATE table coljson(id IDENTITY(1,1), json varchar(128), vmap long varbinary(10000));
    CREATE TABLE
    dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) FROM '/home/dbadmin/file.json';

    Rows Loaded

           1
    

    (1 row)

    1. There is nothing special about loading from hadoop or from file.

    dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) SOURCE Hdfs(url='https://ip-10-11-12-40.us-east-2.compute.internal:50470/webhdfs/v1/user/user/file.json') ;
    WARNING 7768: HDFS Connector is deprecated. Use 'COPY FROM' syntax directly with 'hdfs://' scheme URLs

    Rows Loaded

           1
    

    (1 row)

    dbadmin=> select id,json from coljson ;
    id | json
    --------+------------------------------------------------------------------------------------------
    1 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    250001 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
    (2 rows)

    Now you can parse out the json into various columns...

    dbadmin=> select MapLookup(vmap, 'y1' USING PARAMETERS buffer_size=0) y1, MapLookup(vmap, 'y2' USING PARAMETERS buffer_size=0) y2 ,MapLookup(vmap, 'y3' USING PARAMETERS buffer_size=0) y3 ,MapLookup(vmap, 'y4' USING PARAMETERS buffer_size=0) y4 from coljson ;
    y1 | y2 | y3 | y4
    -------+-------+----------+--------------------------------------
    False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
    False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
    (2 rows)

    ``

    There are various functions to materialize the individual cols from the json, should you want to persist them independently of the json string

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file