Handling array fields in Vertica

rajatpaliwal86 · October 2019

I've been struggling in figuring out how can we store a field with multiple values. I thought of below three approaches
1. creating a hybrid flex table with storing array type fields as virtual. Vertica implicitly stores array types as a map so I can use MAPCONTAINSVALUE() while looking up for a specific field. But I found MAPCONTAINSVALUE() to be extremely slow and not sure how I can materialize such fields if it's possible.
2. Keep such attributes in a separate table that's how SQL maintains the one-many relationship.
3. Storing a value as CSV but while querying we need to use like operator instead of =. e.g. where column like '%somevalue%'. This looks bad too.

What do you think is the best approach to handle array type fields in Vertica?
Can we materialize array type fields from the flex table?
Any downside of using normalized tables and joins?

marcothesane · October 2019

In that case, I'd model a parent table with all common fields that are not arrays. And I'd give it a primary key if it does not already have one - by using an IDENTITY column, for example.
For each array, you would have two options:

flatten them. If you have an array of temperatures called temp, determine the biggest possible number of temperature measures, and call the columns temp01through temp99 or whatever. Do that for each array.
Create one child table for each array. For temperatures, name the table temp, with columns parent_id INT (the primary key of the parent table, seq INT - the sequence - or array index - and val FLOAT - the temperature value itself. Do that for each array.

In a nutshell - let a relational database behave like a relational database. Tables and atomic values, period. Only with such a design would a query optimizer be able to do its work, anything involving comma separated lists in a string or so would slow down things considerably.

mflower · October 2019

You could load the JSON data into a staging table and then process the staging data into multiple normal tables.
In fact, the staging table, could simply be an External table, so you don't actually have to load the JSON data at all.

eg.
The following example shows a JSON file of “store” records with nested “sales” data for the stores.

Store 1 has 3 sales records
Store 30 has 0 sales records

[
{
"store_key": 1,
"store_name": "Store1",
"store_number": 1,
"store_address": "16 Elm St",
"store_city": "Concord",
"number_of_employees": 18,
"annual_shrinkage": 12576,
"monthly_rent_cost": 2284
,"sales":
[
{
"date_key": 1576,
"product_key": 2435,
"employee_key": 2,
"pos_transaction_number": 4281755,
"sales_quantity": 6,
"sales_dollar_amount": 300,
"transaction_time": "10:03:13",
"tender_type": "Debit"
},
{
"date_key": 647,
"product_key": 770,
"employee_key": 2,
"pos_transaction_number": 1732705,
"sales_quantity": 4,
"sales_dollar_amount": 354,
"transaction_time": "08:11:19",
"tender_type": "Debit"
},
{
"date_key": 647,
"product_key": 158,
"employee_key": 2,
"pos_transaction_number": 1732708,
"sales_quantity": 9,
"sales_dollar_amount": 121,
"transaction_time": "08:11:56",
"tender_type": "Debit"
}
]
},
{
"store_key": 30,
"store_name": "Store30",
"store_number": 30,
"store_address": "378 Alden Ave",
"store_city": "Evansville",
"number_of_employees": 11,
"annual_shrinkage": 20294,
"monthly_rent_cost": 3927
}
]

Loading a nested JSON Array

The JSON data is linked to an intermediate “staging area” table. The relevant records are then extracted from the staging area and loaded into the respective tables

Step 1. Set up permanent tables to store the dimension and fact tables:

This step needs to be performed just once, to create the permanent tables.

CREATE TABLE  store_dim
 (
            store_key int NOT NULL,
            store_name varchar(64),
            store_number int,
            store_address varchar(256),
            store_city varchar(64),
            number_of_employees int,
            annual_shrinkage int,
            foot_traffic int,
            monthly_rent_cost int
        );
CREATE TABLE store_sales_fact
(
    store_key int NOT NULL,
    date_key int NOT NULL,
    product_key int NOT NULL,
    employee_key int NOT NULL,
    pos_transaction_number int NOT NULL,
    sales_quantity int,
    sales_dollar_amount int,
    transaction_time time,
    tender_type varchar(8)
);

Step 2a. Create an external flex temp table as an intermediate staging area for the nested JSON array

These steps would need to be performed each time that we receive a new nested JSON file.

CREATE FLEX EXTERNAL TEMP TABLE store_staging() AS
COPY FROM '/home/nested.json' 
PARSER fjsonparser(flatten_maps = false );

Step 2b. Load the “Store” data from the staging table into the “store_dim” table.

INSERT INTO store_dim
SELECT
     store_key::int
    ,store_name::varchar
    ,store_number::int
    ,store_address::varchar
    ,store_city::varchar
    ,number_of_employees::int
    ,annual_shrinkage::int
    ,foot_traffic::int
    ,monthly_rent_cost::int
FROM store_staging;

COMMIT;

Step 2c. Load the “Store Sales” data from the temporary staging table into the “store_sales_fact” table.

INSERT INTO store_sales_fact
SELECT
     store_key::integer
    , maplookup(values,'date_key')::integer
    , maplookup(values,'product_key')::integer
    , maplookup(values,'employee_key')::integer
    , maplookup(values,'pos_transaction_number')::integer
    , maplookup(values,'sales_quantity')::integer
    , maplookup(values,'sales_dollar_amount')::integer
    , maplookup(values,'transaction_time')::time
    , maplookup(values,'tender_type')::varchar
FROM (
SELECT store_key   , mapitems (Sales) 
OVER(PARTITION BY store_key) 
FROM store_staging
) staging;

COMMIT;

Step 3. Validate that rows have been loaded into the tables. Note that store “1” has 3 sales, and store “30” has zero sales.

select sd.store_key, COUNT(ssf.store_key)
FROM    store_dim sd 
LEFT JOIN  store_sales_fact ssf
ON  sd.store_key = ssf.store_key
GROUP BY 1 ;

store_key | count
-----------+-------
         1 |     3
        30 |     0
(2 rows)

rajatpaliwal86 · October 2019

@mflower Thanks for the reply.
The fileds which can have multiple values can vary from one event to another events. e.g, we'll be storing network events data from different protocols-so an event from HTTP will have its own protocol-specific fields which can have multiple values, another event will have its own protocol-specific fields. Since we'll have multiple protocols so the fields in a row vary a lot along with few common fields like ip-addresses, port-addresses, etc, (around 10-15 common fields). What do you suggest in that case?

mflower · October 2019

Optional attributes (that might or might not occur in the source file) can be handled by the fJSONParser, and then loaded into the relevant table.
Eg. I have modified the earlier example to include extra attributes:

In Store 1, there are attributes "store_type" and "store_colour"
In Store 30, there is an additional attribute "open_all_day"

[
{
"store_key": 1,
"store_name": "Store1",
"store_number": 1,
"store_address": "16 Elm St",
"store_city": "Concord",
"number_of_employees": 18,
"store_type": "Footwear",
"store_colour": "Red"
"annual_shrinkage": 12576,
"monthly_rent_cost": 2284
,"sales":
[
{
"date_key": 1576,
"product_key": 2435,
"employee_key": 2,
"pos_transaction_number": 4281755,
"sales_quantity": 6,
"sales_dollar_amount": 300,
"transaction_time": "10:03:13",
"tender_type": "Debit"
},
{
"date_key": 647,
"product_key": 770,
"employee_key": 2,
"pos_transaction_number": 1732705,
"sales_quantity": 4,
"sales_dollar_amount": 354,
"transaction_time": "08:11:19",
"tender_type": "Debit"
},
{
"date_key": 647,
"product_key": 158,
"employee_key": 2,
"pos_transaction_number": 1732708,
"sales_quantity": 9,
"sales_dollar_amount": 121,
"transaction_time": "08:11:56",
"tender_type": "Debit"
}
]
},
{
"store_key": 30,
"store_name": "Store30",
"store_number": 30,
"store_address": "378 Alden Ave",
"store_city": "Evansville",
"number_of_employees": 11,
"open_all_day": 0,
"annual_shrinkage": 20294,
"monthly_rent_cost": 3927
}
]

So we create our store dimension table with all of the extra attributes, including the optional ones.

CREATE TABLE  store_dim
 (
            store_key int NOT NULL,
            store_name varchar(64),
            store_number int,
            store_address varchar(256),
            store_city varchar(64),
            number_of_employees int,
            annual_shrinkage int,
            foot_traffic int,
            monthly_rent_cost int,
            store_type varchar(20),
            store_colour varchar(20),
        open_all_day int
        );

As shown earlier in this thread, load the JSON file into the staging table, using the FJSONParser.

CREATE FLEX EXTERNAL TEMP TABLE store_staging() AS
COPY FROM '/home/nested.json' 
PARSER fjsonparser(flatten_maps = false );

We load all the parent staging attributes into the store table:

INSERT INTO store_dim
SELECT
     store_key::int
    ,store_name::varchar
    ,store_number::int
    ,store_address::varchar
    ,store_city::varchar
    ,number_of_employees::int
    ,annual_shrinkage::int
    ,foot_traffic::int
    ,monthly_rent_cost::int
            ,store_type::varchar
            ,store_colour::varchar
            ,open_all_day::int
FROM store_staging;

Verify the results;

    SELECT store_key
                ,store_type
                ,store_colour
                ,open_all_day
      FROM   store_dim
     store_key | store_type | store_colour | open_all_day
    -------------+-------------+--------------+-----------------
                         1 | Footwear   | Red                 |
                      30 |                        |                          |      0
    (2 rows)

rajatpaliwal86 · November 2019

@marcothesane said:
In that case, I'd model a parent table with all common fields that are not arrays. And I'd give it a primary key if it does not already have one - by using an IDENTITY column, for example.
For each array, you would have two options:

flatten them. If you have an array of temperatures called temp, determine the biggest possible number of temperature measures, and call the columns temp01through temp99 or whatever. Do that for each array.

Create one child table for each array. For temperatures, name the table temp, with columns parent_id INT (the primary key of the parent table, seq INT - the sequence - or array index - and val FLOAT - the temperature value itself. Do that for each array.

In a nutshell - let a relational database behave like a relational database. Tables and atomic values, period. Only with such a design would a query optimizer be able to do its work, anything involving comma separated lists in a string or so would slow down things considerably.

@marcothesane Can I Keep a common table instead of an individual table for each array type field?
like
parent_id, seq, field1, field2,....

marcothesane · November 2019

Do you mean:
parent_id|enum_type |seq|val
42 |instalment| 0|125.28
42 |instalment| 1|301.28
42 |instalment| 2| 85.25
42 |instalment| 3|102.48
42 |refund | 0| 2.48
42 |refund | 1| 34.37
42 |refund | 2|102.63
42 |refund | 3|245.78
42 |refund | 4| 54.42
?

Then I suppose, yes. Or do you mean something else?

rajatpaliwal86 · November 2019

@marcothesane said:
Do you mean:
parent_id|enum_type |seq|val
42 |instalment| 0|125.28
42 |instalment| 1|301.28
42 |instalment| 2| 85.25
42 |instalment| 3|102.48
42 |refund | 0| 2.48
42 |refund | 1| 34.37
42 |refund | 2|102.63
42 |refund | 3|245.78
42 |refund | 4| 54.42
?

Then I suppose, yes. Or do you mean something else?

No, I meant creating a table like
create table array_fields
(
parent_id INT,
index INT,
installment INT,
refund INT,
..
);
i.e a separate column for each enum type in your case

marcothesane · November 2019

Never tried that design, really. It can become awkward if the size of the arrays can change. What do you do if you have , in the example above, 5 refunds and 5 installments, and then installment 3 needs to be removed? Do you set it to NULL in that row, allowing for gaps, or do you "contract" them, so that the previous installment 4 becomes installment 3, and the previous installment 5 becomes installment 4, and set the installment column for the fifth row to NULL?

rajatpaliwal86 · November 2019

@marcothesane said:
Never tried that design, really. It can become awkward if the size of the arrays can change. What do you do if you have , in the example above, 5 refunds and 5 installments, and then installment 3 needs to be removed? Do you set it to NULL in that row, allowing for gaps, or do you "contract" them, so that the previous installment 4 becomes installment 3, and the previous installment 5 becomes installment 4, and set the installment column for the fifth row to NULL?

Yes, I too realized that this design would not work and queries will not work as expected. Considering your original suggestion of keeping individual tables for each field- we have around 20 such fields- do you see any problem with Vertica if the number of tables is huge?

marcothesane · November 2019

No, I don't think that 20 array tables for one main table should hurt too much. I would not, actually, expect a report to always include all 20 of those tables, so it wouldn't hurt too much at query time, either.

I would still explore the array-flattening path, though. Of course, with arrays of 20, or even 100 elements, you can't flatten them elegantly into one table. But arrays of 5, I would flatten into:

CREATE TABLE with_arrays (
  id INT NOT NULL PRIMARY KEY
, name VARCHAR(32)
, installment_1 FLOAT
, installment_2 FLOAT
, installment_3 FLOAT
, installment_4 FLOAT
, installment_5 FLOAT
, refund_1 FLOAT
, refund_2 FLOAT
, refund_3 FLOAT
, refund_4 FLOAT
, refund_5 FLOAT
);

rajatpaliwal86 · November 2019

@marcothesane said:
No, I don't think that 20 array tables for one main table should hurt too much. I would not, actually, expect a report to always include all 20 of those tables, so it wouldn't hurt too much at query time, either.

I would still explore the array-flattening path, though. Of course, with arrays of 20, or even 100 elements, you can't flatten them elegantly into one table. But arrays of 5, I would flatten into:
CREATE TABLE with_arrays (
  id INT NOT NULL PRIMARY KEY
, name VARCHAR(32)
, installment_1 FLOAT
, installment_2 FLOAT
, installment_3 FLOAT
, installment_4 FLOAT
, installment_5 FLOAT
, refund_1 FLOAT
, refund_2 FLOAT
, refund_3 FLOAT
, refund_4 FLOAT
, refund_5 FLOAT
);

Ok, Thanks for the reply. The primary table is partitioned by a month id, shall I add that field to all child tables and include partition clause to all of them? what are your thoughts?

We're Moving!

Create My New Community Account Now