The Vertica Forum recently got a makeover! Let us know what you think by filling out this short, anonymous survey.

Handling array fields in Vertica

I've been struggling in figuring out how can we store a field with multiple values. I thought of below three approaches
1. creating a hybrid flex table with storing array type fields as virtual. Vertica implicitly stores array types as a map so I can use MAPCONTAINSVALUE() while looking up for a specific field. But I found MAPCONTAINSVALUE() to be extremely slow and not sure how I can materialize such fields if it's possible.
2. Keep such attributes in a separate table that's how SQL maintains the one-many relationship.
3. Storing a value as CSV but while querying we need to use like operator instead of =. e.g. where column like '%somevalue%'. This looks bad too.

What do you think is the best approach to handle array type fields in Vertica?
Can we materialize array type fields from the flex table?
Any downside of using normalized tables and joins?

Best Answer

Answers

  • mflowermflower Employee

    You could load the JSON data into a staging table and then process the staging data into multiple normal tables.
    In fact, the staging table, could simply be an External table, so you don't actually have to load the JSON data at all.

    eg.
    The following example shows a JSON file of “store” records with nested “sales” data for the stores.

    • Store 1 has 3 sales records
    • Store 30 has 0 sales records

    [
    {
    "store_key": 1,
    "store_name": "Store1",
    "store_number": 1,
    "store_address": "16 Elm St",
    "store_city": "Concord",
    "number_of_employees": 18,
    "annual_shrinkage": 12576,
    "monthly_rent_cost": 2284
    ,"sales":
    [
    {
    "date_key": 1576,
    "product_key": 2435,
    "employee_key": 2,
    "pos_transaction_number": 4281755,
    "sales_quantity": 6,
    "sales_dollar_amount": 300,
    "transaction_time": "10:03:13",
    "tender_type": "Debit"
    },
    {
    "date_key": 647,
    "product_key": 770,
    "employee_key": 2,
    "pos_transaction_number": 1732705,
    "sales_quantity": 4,
    "sales_dollar_amount": 354,
    "transaction_time": "08:11:19",
    "tender_type": "Debit"
    },
    {
    "date_key": 647,
    "product_key": 158,
    "employee_key": 2,
    "pos_transaction_number": 1732708,
    "sales_quantity": 9,
    "sales_dollar_amount": 121,
    "transaction_time": "08:11:56",
    "tender_type": "Debit"
    }
    ]
    },
    {
    "store_key": 30,
    "store_name": "Store30",
    "store_number": 30,
    "store_address": "378 Alden Ave",
    "store_city": "Evansville",
    "number_of_employees": 11,
    "annual_shrinkage": 20294,
    "monthly_rent_cost": 3927
    }
    ]

    Loading a nested JSON Array

    The JSON data is linked to an intermediate “staging area” table. The relevant records are then extracted from the staging area and loaded into the respective tables

    Step 1. Set up permanent tables to store the dimension and fact tables:

    This step needs to be performed just once, to create the permanent tables.

    CREATE TABLE  store_dim
     (
                store_key int NOT NULL,
                store_name varchar(64),
                store_number int,
                store_address varchar(256),
                store_city varchar(64),
                number_of_employees int,
                annual_shrinkage int,
                foot_traffic int,
                monthly_rent_cost int
            );
    CREATE TABLE store_sales_fact
    (
        store_key int NOT NULL,
        date_key int NOT NULL,
        product_key int NOT NULL,
        employee_key int NOT NULL,
        pos_transaction_number int NOT NULL,
        sales_quantity int,
        sales_dollar_amount int,
        transaction_time time,
        tender_type varchar(8)
    );
    

    Step 2a. Create an external flex temp table as an intermediate staging area for the nested JSON array

    These steps would need to be performed each time that we receive a new nested JSON file.

    CREATE FLEX EXTERNAL TEMP TABLE store_staging() AS
    COPY FROM '/home/nested.json' 
    PARSER fjsonparser(flatten_maps = false );
    

    Step 2b. Load the “Store” data from the staging table into the “store_dim” table.

    INSERT INTO store_dim
    SELECT
         store_key::int
        ,store_name::varchar
        ,store_number::int
        ,store_address::varchar
        ,store_city::varchar
        ,number_of_employees::int
        ,annual_shrinkage::int
        ,foot_traffic::int
        ,monthly_rent_cost::int
    FROM store_staging;
    
    COMMIT;
    

    Step 2c. Load the “Store Sales” data from the temporary staging table into the “store_sales_fact” table.

    INSERT INTO store_sales_fact
    SELECT
         store_key::integer
        , maplookup(values,'date_key')::integer
        , maplookup(values,'product_key')::integer
        , maplookup(values,'employee_key')::integer
        , maplookup(values,'pos_transaction_number')::integer
        , maplookup(values,'sales_quantity')::integer
        , maplookup(values,'sales_dollar_amount')::integer
        , maplookup(values,'transaction_time')::time
        , maplookup(values,'tender_type')::varchar
    FROM (
    SELECT store_key   , mapitems (Sales) 
    OVER(PARTITION BY store_key) 
    FROM store_staging
    ) staging;
    
    COMMIT;
    

    Step 3. Validate that rows have been loaded into the tables. Note that store “1” has 3 sales, and store “30” has zero sales.

    select sd.store_key, COUNT(ssf.store_key)
    FROM    store_dim sd 
    LEFT JOIN  store_sales_fact ssf
    ON  sd.store_key = ssf.store_key
    GROUP BY 1 ;
    
    store_key | count
    -----------+-------
             1 |     3
            30 |     0
    (2 rows)
    
  • @mflower Thanks for the reply.
    The fileds which can have multiple values can vary from one event to another events. e.g, we'll be storing network events data from different protocols-so an event from HTTP will have its own protocol-specific fields which can have multiple values, another event will have its own protocol-specific fields. Since we'll have multiple protocols so the fields in a row vary a lot along with few common fields like ip-addresses, port-addresses, etc, (around 10-15 common fields). What do you suggest in that case?

  • mflowermflower Employee

    Optional attributes (that might or might not occur in the source file) can be handled by the fJSONParser, and then loaded into the relevant table.
    Eg. I have modified the earlier example to include extra attributes:

    • In Store 1, there are attributes "store_type" and "store_colour"
    • In Store 30, there is an additional attribute "open_all_day"

    [
    {
    "store_key": 1,
    "store_name": "Store1",
    "store_number": 1,
    "store_address": "16 Elm St",
    "store_city": "Concord",
    "number_of_employees": 18,
    "store_type": "Footwear",
    "store_colour": "Red"
    "annual_shrinkage": 12576,
    "monthly_rent_cost": 2284
    ,"sales":
    [
    {
    "date_key": 1576,
    "product_key": 2435,
    "employee_key": 2,
    "pos_transaction_number": 4281755,
    "sales_quantity": 6,
    "sales_dollar_amount": 300,
    "transaction_time": "10:03:13",
    "tender_type": "Debit"
    },
    {
    "date_key": 647,
    "product_key": 770,
    "employee_key": 2,
    "pos_transaction_number": 1732705,
    "sales_quantity": 4,
    "sales_dollar_amount": 354,
    "transaction_time": "08:11:19",
    "tender_type": "Debit"
    },
    {
    "date_key": 647,
    "product_key": 158,
    "employee_key": 2,
    "pos_transaction_number": 1732708,
    "sales_quantity": 9,
    "sales_dollar_amount": 121,
    "transaction_time": "08:11:56",
    "tender_type": "Debit"
    }
    ]
    },
    {
    "store_key": 30,
    "store_name": "Store30",
    "store_number": 30,
    "store_address": "378 Alden Ave",
    "store_city": "Evansville",
    "number_of_employees": 11,
    "open_all_day": 0,
    "annual_shrinkage": 20294,
    "monthly_rent_cost": 3927
    }
    ]

    So we create our store dimension table with all of the extra attributes, including the optional ones.

    CREATE TABLE  store_dim
     (
                store_key int NOT NULL,
                store_name varchar(64),
                store_number int,
                store_address varchar(256),
                store_city varchar(64),
                number_of_employees int,
                annual_shrinkage int,
                foot_traffic int,
                monthly_rent_cost int,
                store_type varchar(20),
                store_colour varchar(20),
            open_all_day int
            );
    

    As shown earlier in this thread, load the JSON file into the staging table, using the FJSONParser.

    CREATE FLEX EXTERNAL TEMP TABLE store_staging() AS
    COPY FROM '/home/nested.json' 
    PARSER fjsonparser(flatten_maps = false );
    

    We load all the parent staging attributes into the store table:

    INSERT INTO store_dim
    SELECT
         store_key::int
        ,store_name::varchar
        ,store_number::int
        ,store_address::varchar
        ,store_city::varchar
        ,number_of_employees::int
        ,annual_shrinkage::int
        ,foot_traffic::int
        ,monthly_rent_cost::int
                ,store_type::varchar
                ,store_colour::varchar
                ,open_all_day::int
    FROM store_staging;
    

    Verify the results;

        SELECT store_key
                    ,store_type
                    ,store_colour
                    ,open_all_day
          FROM   store_dim
         store_key | store_type | store_colour | open_all_day
        -------------+-------------+--------------+-----------------
                             1 | Footwear   | Red                 |
                          30 |                        |                          |      0
        (2 rows)
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.