Handling array fields in Vertica
I've been struggling in figuring out how can we store a field with multiple values. I thought of below three approaches
1. creating a hybrid flex table with storing array type fields as virtual. Vertica implicitly stores array types as a map so I can use MAPCONTAINSVALUE() while looking up for a specific field. But I found MAPCONTAINSVALUE() to be extremely slow and not sure how I can materialize such fields if it's possible.
2. Keep such attributes in a separate table that's how SQL maintains the one-many relationship.
3. Storing a value as CSV but while querying we need to use like operator instead of =. e.g. where column like '%somevalue%'. This looks bad too.
What do you think is the best approach to handle array type fields in Vertica?
Can we materialize array type fields from the flex table?
Any downside of using normalized tables and joins?
marcothesane - Select Field - Administrator
In that case, I'd model a parent table with all common fields that are not arrays. And I'd give it a primary key if it does not already have one - by using an IDENTITY column, for example.
For each array, you would have two options:
- flatten them. If you have an array of temperatures called
temp, determine the biggest possible number of temperature measures, and call the columns
temp99or whatever. Do that for each array.
- Create one child table for each array. For temperatures, name the table
temp, with columns
parent_id INT(the primary key of the parent table,
seq INT- the sequence - or array index - and
val FLOAT- the temperature value itself. Do that for each array.
In a nutshell - let a relational database behave like a relational database. Tables and atomic values, period. Only with such a design would a query optimizer be able to do its work, anything involving comma separated lists in a string or so would slow down things considerably.5
- flatten them. If you have an array of temperatures called
You could load the JSON data into a staging table and then process the staging data into multiple normal tables.
In fact, the staging table, could simply be an External table, so you don't actually have to load the JSON data at all.
The following example shows a JSON file of “store” records with nested “sales” data for the stores.
Loading a nested JSON Array
The JSON data is linked to an intermediate “staging area” table. The relevant records are then extracted from the staging area and loaded into the respective tables
Step 1. Set up permanent tables to store the dimension and fact tables:
This step needs to be performed just once, to create the permanent tables.
Step 2a. Create an external flex temp table as an intermediate staging area for the nested JSON array
These steps would need to be performed each time that we receive a new nested JSON file.
Step 2b. Load the “Store” data from the staging table into the “store_dim” table.
Step 2c. Load the “Store Sales” data from the temporary staging table into the “store_sales_fact” table.
Step 3. Validate that rows have been loaded into the tables. Note that store “1” has 3 sales, and store “30” has zero sales.
@mflower Thanks for the reply.
The fileds which can have multiple values can vary from one event to another events. e.g, we'll be storing network events data from different protocols-so an event from HTTP will have its own protocol-specific fields which can have multiple values, another event will have its own protocol-specific fields. Since we'll have multiple protocols so the fields in a row vary a lot along with few common fields like ip-addresses, port-addresses, etc, (around 10-15 common fields). What do you suggest in that case?
Optional attributes (that might or might not occur in the source file) can be handled by the fJSONParser, and then loaded into the relevant table.
Eg. I have modified the earlier example to include extra attributes:
So we create our store dimension table with all of the extra attributes, including the optional ones.
As shown earlier in this thread, load the JSON file into the staging table, using the FJSONParser.
We load all the parent staging attributes into the store table:
Verify the results;
@marcothesane Can I Keep a common table instead of an individual table for each array type field?
parent_id, seq, field1, field2,....
Do you mean:
42 |instalment| 0|125.28
42 |instalment| 1|301.28
42 |instalment| 2| 85.25
42 |instalment| 3|102.48
42 |refund | 0| 2.48
42 |refund | 1| 34.37
42 |refund | 2|102.63
42 |refund | 3|245.78
42 |refund | 4| 54.42
Then I suppose, yes. Or do you mean something else?
No, I meant creating a table like
create table array_fields
i.e a separate column for each enum type in your case
Never tried that design, really. It can become awkward if the size of the arrays can change. What do you do if you have , in the example above, 5 refunds and 5 installments, and then installment 3 needs to be removed? Do you set it to NULL in that row, allowing for gaps, or do you "contract" them, so that the previous installment 4 becomes installment 3, and the previous installment 5 becomes installment 4, and set the installment column for the fifth row to NULL?
Yes, I too realized that this design would not work and queries will not work as expected. Considering your original suggestion of keeping individual tables for each field- we have around 20 such fields- do you see any problem with Vertica if the number of tables is huge?
No, I don't think that 20 array tables for one main table should hurt too much. I would not, actually, expect a report to always include all 20 of those tables, so it wouldn't hurt too much at query time, either.
I would still explore the array-flattening path, though. Of course, with arrays of 20, or even 100 elements, you can't flatten them elegantly into one table. But arrays of 5, I would flatten into:
Ok, Thanks for the reply. The primary table is partitioned by a month id, shall I add that field to all child tables and include partition clause to all of them? what are your thoughts?