Is it possible to update a Flex Table view?
Say I load some JSON data into a table and create a Flex Table view using COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW(). Later, I load additional JSON data into the same table. I'd like to update the Flex Table view to reflect the additional columns or changed data types introduced by the new JSON data. How would I do that?
0
Comments
The Flex Table Data Functions (p. 71 on the linked doc) may be called repeatedly. Each time, they will perform their updates on the latest dataset to produce a view reflecting the contents of the table at the time they are run without knowledge that they had previously been run.
Thanks,
James
Your comment that the function call is not aware of previous calls is interesting. Does that mean Vertica will recompute keys and datatypes for all the data each time the function is called, even though only part of the data in the table is new? That could be very costly as the source Flex Table grows.
Is there a way to incrementally update the Flex Table view definition as new batches of "dark" data are loaded into the same table? Basically, we want to avoid having to recompute keys and datatypes for data we have already done that for.
This brings up a second question. Say I have a Flex Table and a matching view. If I load new data into the Flex table that does not match the view definition and I don't rerun the COMPUTE_... function, what happens? Do I get errors if I try to query the view at all?
Nick
(1st paragraph) This is correct.
(2nd paragraph) In this initial FlexTable release, there is no fully automated way to do so, but we've already put thought into supporting this directly. We're very interested in hearing any use cases you have for this and/or preferences for how you think it should look.
Potentially useful for this issue is that we allow you to manually update the keys table. A custom query could be used to update the keys, for example by looking at which rows in your Flex Table are from a more recent epoch than the latest epoch in your keys table.
(3rd paragraph) Flexible Tables assume variable schema and will not give errors for requesting data which does not exist. Queries will include both values associated with columns which exist in any rows being returned and NULLs for rows which do not contain the column.
James
We are considering using Flex Tables to allow our analysts to query JSON data via SQL without us having to transform it into a tabular format first, which is what we currently do. A given set of JSON data will go into a given table, and over time its schema will evolve. If we have to recompute keys and types for the whole table every time a new batch is loaded, then I assume that steps will take increasingly long to complete as the table grows. It will probably also make it impossible for us to trickle load a Flex Table and have its keys and data types always be up-to-date.
There are many options available to solve this problem incrementally, and the best choice is very dependent on your data's attributes and lifecycle. In the meantime, there is no single "best practice".
Of note is that we allow you to make any changes you want to the keys table, and have those changes reflected in the view when you next call BUILD_FLEXTABLE_VIEW(). This means we fully support computing incremental keysets then merging the new keyset into an existing keys table.
If you'd like to see the query COMPUTE_FLEXTABLE_KEYS() and COMPUTE_FLEXTABLE_KEYS_AND_BUILD_VIEW() use to compute the keys, you can look at your vertica.log for the query after running either function. Assuming you're working with a Flexible Table called "foo", the keys query will start with: "DELETE FROM public.foo_keys;". The interesting part of the query is:
SELECT keys, COUNT(*) as frequency, max(length) as length, max(type_oid) as type_oid FROM ( SELECT mapkeysInfo(__raw__) OVER() FROM public.foo ) all_keyinfo GROUP BY keys;'
If you'd like to filter that to only include rows made since a given load, check the epoch that the rows were loaded on, then filter by that, ala:
SELECT epoch, keys, COUNT(*) as frequency, max(length) as length, max(type_oid) as type_oid FROM ( SELECT epoch, mapkeysInfo(__raw__) OVER(PARTITION BY epoch) FROM public.foo ) all_keyinfo WHERE epoch > {EpochWhereYouLastComputedKeys} GROUP BY keys, epoch;
James
We'll have a good idea of what that functionality should look like when we've had a go at implementing our own. Until then, I imagine something that works as follows would suit our needs:
There's an UPDATE_FLEXTABLE_KEYS() function that takes as input the target table and some kind of filter. The filter is used to limit the data used to compute keys and data types. In our case, we'd use the filter to get the data most recently loaded into the table. The function would merge the newly computed keys and data types into the existing ones for that table.
This functionality fits the following usage pattern: You have a Flex Table that grows on a regular basis. You never delete from the table. The schema of the data loaded into the table slowly evolves with time. You need a quick way to keep the Flex View up-to-date with the data in the base table as it grows.
Regarding my DML vs. DDL comment: One of the larger design challenges with Flexible Tables is reconciling that since each row in a table might have its own unique schema, load operations (DML) can cause apparent table structure changes (DDL). In the case of a theoretical feature such as automatic maintenance of the Flex Table VIEW, the line is further blurred as loading data into the Flexible Table could cause new columns to appear in the VIEW.
Thanks!
James
I cannot think of a common usage pattern where it would be preferable to provide a filter over a simple flag as you described.
James