How to extract elements in repeated field into separate records (rows)?

Hi all!

I have JSON flex table with nested records as array.

Like this:

 

{
"ID":"12345",
"Update":{
"Time":"20150227",
"Type":"input",
"Record":[
{
"PartID":"4",
"PartType":"0"
},
{
"PartID":"7",
"PartType":"1"
}
]
}
}

 

I need to transform it in table like this:

 

 ID  | PartID | PartType
-----+--------+---------
12345 4 0
12345 7 1

In Apache Drill it could easily be done through FLATTEN function. 
See https://cwiki.apache.org/confluence/display/DRILL/FLATTEN+Function for details.

Is that true that Vertica does not have such a great possibility? Or this could be done some other way?

 

 

 

 

 

 

 

 

Comments

  • I've just started to explore the Vertica JSON functionality, but you will most likely need a script to format it in your desired output. You can specify to flatten_arrays=true in the fjsonparser to flatten the submaps:

     

    dbadmin=> CREATE FLEX TABLE ar();
    CREATE TABLE
    dbadmin=> COPY ar FROM '/tmp/ar.json' PARSER fjsonparser(flatten_arrays=true);
    Rows Loaded
    -------------
    1
    (1 row)

    dbadmin=> SELECT compute_flextable_keys_and_build_view('ar');
    compute_flextable_keys_and_build_view
    ------------------------------------------------------------------------------------------
    Please see public.ar_keys for updated keys
    The view public.ar_view is ready for querying
    (1 row)

    dbadmin=> \x
    Expanded display is on.
    dbadmin=> SELECT * FROM ar_view;
    -[ RECORD 1 ]------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    id | 12345
    update.time | 20150227
    update.type | input
    update.record | \001\000\000\000\030\000\000\000\004\000\000\000\024\000\000\000\025\000\000\000\026\000\000\000\027\000\000\0004071\004\000\000\000\024\000\000\000\034\000\000\000&\000\000\000.\000\000\0000.PartID0.PartType1.PartID1.PartType
    update.record.0.partid |
    update.record.0.parttype |
    update.record.1.partid |
    update.record.1.parttype |
    -[ RECORD 2 ]------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    id | 12345
    update.time | 20150227
    update.type | input
    update.record |
    update.record.0.partid | 4
    update.record.0.parttype | 0
    update.record.1.partid | 7
    update.record.1.parttype | 1

    The closest I could get is:

     

    dbadmin=> SELECT id, "update.record.0.partid", "update.record.0.parttype", "update.record.1.partid", "update.record.1.parttype" FROM ar_view;
    id | update.record.0.partid | update.record.0.parttype | update.record.1.partid | update.record.1.parttype
    -------+------------------------+--------------------------+------------------------+--------------------------
    12345 | | | |
    12345 | 4 | 0 | 7 | 1
    (2 rows)

     HTH

  • Hi, norbert-krupa.

     

    Thanks for your reply.

     

    Unfortunately that is definitely what we try to avoid. We need to make selects based on Record fields PartID and PartType. Actually to aggregate them and count - how many pairs of "PartID - PartType" there is in the table. And while Record array splits to separate columns this could not be achived..

     

    Typically we have hundreds of 'Record' in 'Update' array. So, in that case Vertica should create hundreds columns and that is not so useful.

  • I'm going through a similar exercise in which I'm trying to normalize many nested arrays. If I'm successful I'll make sure to share my findings.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file