Data load performance vs query performance for Flex tables

Vadim_Levin · March 2014

We are trying to evaluate Flex table functionality for our game data analytics pipeline. Input JSON data change frequently, e.g. keys get added/modified on a daily basis, and we need to load the data on a daily basis. Because Flex table compute_keys function does not seem to infer data type of keys correctly and it is expensive, we need to infer "metadata" daily ourselves and then follow two possible scenarios:
1) Infer metadata. Not to use Flex table functionality at all, use conventional Vertica table(s) Generate "ALTER TABLE.." statement based on metadata changes, run the statement, load the data.
2) Use Flex table functionality. Infer metadata and generate "UPDATE/INSERT INTO public.some_table_keys TABLE..." statement Load data into Flex table. Run the statement above. Rebuild Flex table view and tell our customers to use the view. This is, probably, faster than altering table in the first scenario. So far so good. But Flex table documentation insists that for fast querying it is necessary to materialize flex table columns.

So, from data load performance point of view, is there any advantage of using the second scenario: updating flex table view and materializing new columns over using the first scenario of altering conventional table? And if there is no advantage, then is there any point in using Flex tables in our use case?

Thanks.

[Deleted User] · March 2014

Hi Vadim,

Per my response on your previous post, getting types right here is a little bit tricky, in the "there's not just one right answer" sense. You're correct that our default (and, currently, only) implementation is very loose with the types that it chooses.

The trade-off between these two options is the classic trade-off of load-time vs query-time conversion. Do you want to pay a big up-front cost to make subsequent queries faster? That's going to be up to you; I'd recommend loading some data and running some queries, seeing what matches your performance needs.

Personally, I think I would recommend a third option for your use case -- use hybrid flex tables. The documentation discusses these; you can refer to it for details. Hybrid flex tables have both regular typed columns (the documentation discusses how to manage them) and flex columns. If a new column shows up, by default it will be immediately queryable as a VARCHAR (no need to even create the view, though you'll have to know or query its name to find it); but once you are confident of the data type, you can materialize the new column and give it a type.

The advantage here is, it's still a flex table; we can still store (and, with some SQL gymnastics, allow you to query) values that don't match whatever type you specify, should you need it. Also, if your users need immediate access to new values, you can give them that access; they don't have to wait a day for a view or a table to be generated.

Adam

[Deleted User] · March 2014

Incidentally: If you don't like "compute_flextable_keys()" but you want to use the rest of the Flex machinery, you can write your own function and use its output to update the keys table. (I believe you can modify the keys table with UPDATE/etc statements just like any other table.) Then the rest of the Flex machinery will work just fine on your version of the table.

We're Moving!

Create My New Community Account Now

Data load performance vs query performance for Flex tables

Comments

Leave a Comment