Ingest column with N number of array element
Hi Vertica support team,
I have a requirement to ingest data with array elements. Here is the example data with 3 columns, col1, col1 are normal int column but column "mapElementColumn" is having array elements which has sub-ccolumns.
Example Data
[
{
"col1": 34,
"col2": 44,
"mapElementColumn": [
{
"ip": "12.13.8.178",
"dns": null,
"userresourceid": null,
"siteresourceid": 10623
},
{
"ip": "12.13.28.243",
"dns": null,
"userresourceid": null,
"siteresourceid": 10020
}
]
},
{
"col1": 35,
"col2": 54,
"mapElementColumn": [
{
"ip": "11.12.28.243",
"dns": null,
"userresourceid": null,
"siteresourceid": 10020
},
{
"ip": "11.12.28.244",
"dns": null,
"userresourceid": null,
"siteresourceid": 10020
}
]
}
]
My question, is it beneficial to store array column [mapElementColumn] data in varchar and "explode" at query time OR explode at ingest time and store multiple rows [per sub-element]?
I have following query (working) -
SELECT
col1::int
,col1::int
,maplookup(values,'ip')::varchar AS 'ip'
,maplookup(values,'dns')::varchar AS 'dns'
,maplookup(values,'userresourceid')::varchar AS 'userresourceid'
,maplookup(values,'siteresourceid')::varchar AS 'siteresourceid'
FROM (
SELECT col1,col2, mapitems (mapElementColumn)
OVER(PARTITION BY col1, col2)
FROM
<
table name>
) table_alias
The volume of data is 3k/min.
Vertica Analytic Database v9.3.1-0
Answers
If the number of elements in an array is small, it is always best to explode at ingest time and store them in their primitive type format.
Are there any estimates on the average and max number of elements in an array?
Hi DeepakMajeti,
Thanks for reply. We don't have fixed element count but it would be approx. 10.
1. If we explode at ingest time then we need to explode array column with some sort of "maplookup" queries .
2. Or, are you suggesting parent/child table relational ingest? If so, as I mentioned we don't have unique identifier in our data. One way to overcome this is to use FLEX table with "identity" column:
CREATE FLEX TABLE store_staging(id identity);
But, in both of above cases we need some short of cron job to explode data and ingestsingle table (with duplicates rows /per element count) or parent/child table periodically which would add overhead.
Is there a way to explode data at ingest time without using FLEX table or cron job (like using --load-spec)?
Thanks
If the number of columns is 3 and two of them being integers, then the duplication should not be a concern. But normalizing the schema as fact/dimension tables is optimal.
To build a single table with duplication you can do the following
1) Create an external table with the JSON data as follow
create external table data(col1 int, col2 int, mapElementColumn long varbinary) as copy from '/scratch_b/dmajeti/temp/example1.json' PARSER fjsonparser (flatten_arrays=false, flatten_maps=false);
2) CTAS from the external table into the internal table by exploding the arrays
create table foo as select col1, col2, elements['dns'] as dns, elements['ip'] as ip, elements['siteresourceid'] as siteresourceid, elements['userresourceid'] as userresourceid from (select mapitems(mapElementColumn, col1, col2) OVER(PARTITION BEST) as (keys, elements, col1, col2) from data)q1;
To build the fact-dimension tables
1) Use two sequence generators for unique ids. Lets name these
myseq1
,myseq2
. https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Sequences/CreatingandInstantiatingNamedSequences.htm?tocpath=Administrator's Guide|Working with Vertica-Managed Tables|Sequences|Named Sequences|_____12) Load fact table with the two integer columns using the external table approach above with
myseq1
3) Load dimension table with the Array data using the external table + explode approach above with
myseq2
The fact+dimension table is always optimal. A single table with integer duplicates is not bad as well since Vertica encodes and compresses the data.
You can use external tables instead of flex and then CTAS.
create external table data(col1 int, col2 int, mapElementColumn long varbinary) as copy from '/data/temp/example1.json' PARSER fjsonparser (flatten_arrays=false, flatten_maps=false);
create table store as select col1, col2, elements['dns'] as dns, elements['ip'] as ip, elements['siteresourceid'] as siteresourceid, elements['userresourceid'] as userresourceid from (select mapitems(mapElementColumn, col1, col2) OVER(PARTITION BEST) as (keys, elements, col1, col2) from data)q1;