Moving data of type MAP (JSON data) from Hive table to Vertica distorts the MAP data?
I have a table in Hive which has a column of type MAP<String, String> which is basically json data. Eg. : {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
As per the Vertica official documentation MAP data in Hive converts to VARCHAR in Vertica. So, I have a corresponding table in Vertica with column type VARCHAR. Now, I am trying to copy this data from Hive table to Vertica table issuing the Vertica COPY command. All data gets copied successfully but the JSON data from hive is copied distorted like this -
y1Falsey2Falsey352402292y4d90fce80-b352-4aca-8d83-6e997e48d917 . As seen here, the JSON format is lost completely. I get a plain string instead in the VARCHAR column.
The command that I issue is -
COPY vertica_table SOURCE Hdfs(url='http:url:50070/webhdfs/v1/user/warehouse/hive_db.db/hive_table_name/part=20170530/*') DELIMITER E'\001' NULL AS '' NO ESCAPE REJECTMAX 10000 DIRECT;
Can someone help me understand what is going wrong during copy? Did I miss something here? Thanks for help.
Answers
I am not sure what specific documentation you are referring to.
Also I am unable to simulate the issue you are referring to.
as per this documentation
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunctions/FlexTables/mapLookup.htm
is this what you are trying to do
`
1.
dbadmin=> CREATE table coljson(id IDENTITY(1,1), json varchar(128), vmap long varbinary(10000));
CREATE TABLE
dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) FROM '/home/dbadmin/file.json';
Rows Loaded
(1 row)
dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) SOURCE Hdfs(url='https://ip-10-11-12-40.us-east-2.compute.internal:50470/webhdfs/v1/user/user/file.json') ;
WARNING 7768: HDFS Connector is deprecated. Use 'COPY FROM' syntax directly with 'hdfs://' scheme URLs
Rows Loaded
(1 row)
dbadmin=> select id,json from coljson ;
id | json
--------+------------------------------------------------------------------------------------------
1 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
250001 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
(2 rows)
Now you can parse out the json into various columns...
dbadmin=> select MapLookup(vmap, 'y1' USING PARAMETERS buffer_size=0) y1, MapLookup(vmap, 'y2' USING PARAMETERS buffer_size=0) y2 ,MapLookup(vmap, 'y3' USING PARAMETERS buffer_size=0) y3 ,MapLookup(vmap, 'y4' USING PARAMETERS buffer_size=0) y4 from coljson ;
y1 | y2 | y3 | y4
-------+-------+----------+--------------------------------------
False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
(2 rows)
`
There are various functions to materialize the individual cols from the json, should you want to persist them independently of the json string
I am unable to reproduce the problem you mention here. can you point me to the documentation you are referring to.
'
cat /tmp/file.json
{"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
1.
dbadmin=> CREATE table coljson(id IDENTITY(1,1), json varchar(128), vmap long varbinary(10000));
CREATE TABLE
dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) FROM '/home/dbadmin/file.json';
Rows Loaded
dbadmin=> select id,json from coljson ;
id | json
----+------------------------------------------------------------------------------------------
1 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
(1 row)
dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) SOURCE Hdfs(url='https://ip-10-11-12-40.us-east-2.compute.internal:50470/webhdfs/v1/user/user/file.json') ;
WARNING 7768: HDFS Connector is deprecated. Use 'COPY FROM' syntax directly with 'hdfs://' scheme URLs
Rows Loaded
dbadmin=> select id, json from coljson ;
id | json
--------+------------------------------------------------------------------------------------------
1 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
250001 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
(2 rows)
For simplicity i will use a file, but hdfs source is no different
one way to do it would be the following
dbadmin=> select MapLookup(vmap, 'y1' USING PARAMETERS buffer_size=0) y1, MapLookup(vmap, 'y2' USING PARAMETERS buffer_size=0) y2 ,MapLookup(vmap, 'y3' USING PARAMETERS buffer_size=0) y3 ,MapLookup(vmap, 'y4' USING PARAMETERS buffer_size=0) y4 from coljson ;
y1 | y2 | y3 | y4
-------+-------+----------+--------------------------------------
False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
(2 rows)
another way would be
dbadmin=> create table X (y1 varchar(200), y2 varchar(200), y3 varchar(300) );
CREATE TABLE
dbadmin=> COPY X (json filler varchar (64000), y1 AS cast (MapLookup(MapJSONExtractor(json), 'y1' USING PARAMETERS buffer_size=0) as varchar(200)), y2 AS cast (MapLookup(MapJSONExtractor(json), 'y2' ) as varchar(200)), y3 AS cast (MapLookup(MapJSONExtractor(json), 'y3' ) as varchar(200))) from '/tmp/file.json';
skeswani=> select * from X;
y1 | y2 | y3
-------+-------+----------
False | False | 52402292
(1 row)
`
anand_soni, is this the document you mentioned in your description?https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/HadoopIntegrationGuide/HCatalogConnector/HCatalogDataTypeConversions.htm
This is actually the result of using HCatalog Connector to access data in Hive database.
Below is another way to use HCatalog Connector in addition to skeswani's solution:
1. suppose you have table map_t in Hive default database :
hive> select * from default.map_t;
OK
1 {"y1":"False"}
2 {"y2":"False"}
3 {"y3":"52402292"}
4 {"y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
Time taken: 0.501 seconds, Fetched: 4 row(s)
then you can create a Vertica HCatalog schema named "hcat_map" with Hive database "default"
=> create hcatalog schema hcat_map with hcatalog_schema = 'default';
CREATE SCHEMA
you can query the table inside Vertica directly:
=> select * from hcat_map.map_t;
test_id | metrics
---------+-----------------------------------------------
2 | {"y2":"False"}
3 | {"y3":"52402292"}
4 | {"y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
1 | {"y1":"False"}
(4 rows)
Note that you need to have WebHCat (a Hive Service) installed to support HCatalog Connector.
Thanks @skeswani and @kguan !
I figured out the actual problem now. Actually, the problem is not with vertica COPY command. I am reading data directly using the datafile present in webhdfs location. The file doesn't contain the data in JSON format but has it as the unformatted string : y1Falsey2Falsey352402292y4d90fce80-b352-4aca-8d83-6e997e48d917 (as I mentioned in the question). So, I think Hive has an internal logic to convert this data to JSON format somehow while showing the data in hive table (running SELECT query from hive will print the same data from same location in JSON format : {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}).
Is there a way to process this raw data into Hive JSON format before running the vertica COPY command? That would solve my problem.
anand_soni, I'm glad you found the root cause of the problem. The hive raw file is for Hive only and usually contains only part of the table, so it's not suppose to be directly access (COPY) by Vertica.
If you want to have a Vertica copy of the Hive table, you can use SYNC_WITH_HCATALOG_SCHEMA or SYNC_WITH_HCATALOG_SCHEMA_TABLE:
1. create a Vertica schema:
=> create schema local_schema;
CREATE SCHEMA
2. sync the whole Hive database to Vertica:
=> select sync_with_hcatalog_schema('local_schema', 'hcat_map');
Schema local_schema synchronized with hcat_map
tables in hcat_map = 7
tables altered in default_schema = 0
tables created in default_schema = 7
stale tables in default_schema = 0
table changes erred in default_schema = 0
3. or sync a single table:
=> select sync_with_hcatalog_schema_table('local_schema', 'hcat_map', 'map_t' );
Schema local_schema synchronized with hcat_map for table map_t
table map_t is created in schema local_schema
Then you can query the table in a local Vertica schema:
=> select * from local_schema.map_t;
3 | {"y3":"52402292"}
4 | {"y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
1 | {"y1":"False"}
2 | {"y2":"False"}
Hope this would help to resolve your problem.
I am not sure what specific documentation you are referring to.
Also I am unable to simulate the issue you are referring to.
as per this documentation
https://my.vertica.com/docs/8.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/VerticaFunctions/FlexTables/mapLookup.htm
is this what you are trying to do
``
1.
dbadmin=> CREATE table coljson(id IDENTITY(1,1), json varchar(128), vmap long varbinary(10000));
CREATE TABLE
dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) FROM '/home/dbadmin/file.json';
Rows Loaded
(1 row)
dbadmin=> COPY coljson (json, vmap AS MapJSONExtractor(json)) SOURCE Hdfs(url='https://ip-10-11-12-40.us-east-2.compute.internal:50470/webhdfs/v1/user/user/file.json') ;
WARNING 7768: HDFS Connector is deprecated. Use 'COPY FROM' syntax directly with 'hdfs://' scheme URLs
Rows Loaded
(1 row)
dbadmin=> select id,json from coljson ;
id | json
--------+------------------------------------------------------------------------------------------
1 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
250001 | {"y1":"False","y2":"False","y3":"52402292","y4":"d90fce80-b352-4aca-8d83-6e997e48d917"}
(2 rows)
Now you can parse out the json into various columns...
dbadmin=> select MapLookup(vmap, 'y1' USING PARAMETERS buffer_size=0) y1, MapLookup(vmap, 'y2' USING PARAMETERS buffer_size=0) y2 ,MapLookup(vmap, 'y3' USING PARAMETERS buffer_size=0) y3 ,MapLookup(vmap, 'y4' USING PARAMETERS buffer_size=0) y4 from coljson ;
y1 | y2 | y3 | y4
-------+-------+----------+--------------------------------------
False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
False | False | 52402292 | d90fce80-b352-4aca-8d83-6e997e48d917
(2 rows)
``
There are various functions to materialize the individual cols from the json, should you want to persist them independently of the json string