Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

Ingest column with N number of array element

Hi Vertica support team,
I have a requirement to ingest data with array elements. Here is the example data with 3 columns, col1, col1 are normal int column but column "mapElementColumn" is having array elements which has sub-ccolumns.
Example Data
[
{
"col1": 34,
"col2": 44,
"mapElementColumn": [
{
"ip": "12.13.8.178",
"dns": null,
"userresourceid": null,
"siteresourceid": 10623
},
{
"ip": "12.13.28.243",
"dns": null,
"userresourceid": null,
"siteresourceid": 10020
}
]
},
{
"col1": 35,
"col2": 54,
"mapElementColumn": [
{
"ip": "11.12.28.243",
"dns": null,
"userresourceid": null,
"siteresourceid": 10020
},
{
"ip": "11.12.28.244",
"dns": null,
"userresourceid": null,
"siteresourceid": 10020
}
]

}

]

My question, is it beneficial to store array column [mapElementColumn] data in varchar and "explode" at query time OR explode at ingest time and store multiple rows [per sub-element]?

I have following query (working) -
SELECT
col1::int
,col1::int
,maplookup(values,'ip')::varchar AS 'ip'
,maplookup(values,'dns')::varchar AS 'dns'
,maplookup(values,'userresourceid')::varchar AS 'userresourceid'
,maplookup(values,'siteresourceid')::varchar AS 'siteresourceid'

FROM (
SELECT col1,col2, mapitems (mapElementColumn)
OVER(PARTITION BY col1, col2)
FROM

<

table name>
) table_alias

The volume of data is 3k/min.
Vertica Analytic Database v9.3.1-0

Answers

  • If the number of elements in an array is small, it is always best to explode at ingest time and store them in their primitive type format.
    Are there any estimates on the average and max number of elements in an array?

  • Hi DeepakMajeti,
    Thanks for reply. We don't have fixed element count but it would be approx. 10.
    1. If we explode at ingest time then we need to explode array column with some sort of "maplookup" queries .
    2. Or, are you suggesting parent/child table relational ingest? If so, as I mentioned we don't have unique identifier in our data. One way to overcome this is to use FLEX table with "identity" column:
    CREATE FLEX TABLE store_staging(id identity);
    But, in both of above cases we need some short of cron job to explode data and ingestsingle table (with duplicates rows /per element count) or parent/child table periodically which would add overhead.
    Is there a way to explode data at ingest time without using FLEX table or cron job (like using --load-spec)?
    Thanks

  • If the number of columns is 3 and two of them being integers, then the duplication should not be a concern. But normalizing the schema as fact/dimension tables is optimal.
    To build a single table with duplication you can do the following
    1) Create an external table with the JSON data as follow
    create external table data(col1 int, col2 int, mapElementColumn long varbinary) as copy from '/scratch_b/dmajeti/temp/example1.json' PARSER fjsonparser (flatten_arrays=false, flatten_maps=false);
    2) CTAS from the external table into the internal table by exploding the arrays
    create table foo as select col1, col2, elements['dns'] as dns, elements['ip'] as ip, elements['siteresourceid'] as siteresourceid, elements['userresourceid'] as userresourceid from (select mapitems(mapElementColumn, col1, col2) OVER(PARTITION BEST) as (keys, elements, col1, col2) from data)q1;
    To build the fact-dimension tables
    1) Use two sequence generators for unique ids. Lets name these myseq1, myseq2. https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Sequences/CreatingandInstantiatingNamedSequences.htm?tocpath=Administrator's Guide|Working with Vertica-Managed Tables|Sequences|Named Sequences|_____1
    2) Load fact table with the two integer columns using the external table approach above with myseq1
    3) Load dimension table with the Array data using the external table + explode approach above with myseq2

  • DeepakMajetiDeepakMajeti Employee
    edited April 2020

    The fact+dimension table is always optimal. A single table with integer duplicates is not bad as well since Vertica encodes and compresses the data.
    You can use external tables instead of flex and then CTAS.
    create external table data(col1 int, col2 int, mapElementColumn long varbinary) as copy from '/data/temp/example1.json' PARSER fjsonparser (flatten_arrays=false, flatten_maps=false);
    create table store as select col1, col2, elements['dns'] as dns, elements['ip'] as ip, elements['siteresourceid'] as siteresourceid, elements['userresourceid'] as userresourceid from (select mapitems(mapElementColumn, col1, col2) OVER(PARTITION BEST) as (keys, elements, col1, col2) from data)q1;

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.