fastest way to put together 100+ columns in a single table
Hello!
I have a sparse table with 100+ columns. Each column is calculated as last value from 100+ tables (one attribute per table, 6NF, each table consists of item_id, value, actual_date) so that the target table contains actual values for 100+ attributes (item_id, value1, value2, value...). 6NF here to maintain changes of each attribute separatly to save space. But after all I need to combine current values of attributes for each item to make analyst life easier. In order to combine attributes together I need to join 100+ tables. It is kinda slow. But it seems that all that I need is to put files containg data blocks for attributes togeather. I can make them to be of the same cardinality. But it seems that there is no such a low level utility for vertica to make it that easy.
Can you advise me how to combine 100+ columns together in a cheapest way.
Comments
Check out the "Flattened Table" feature!
https://my.vertica.com/docs/9.1.x/HTML/index.htm#Authoring/AnalyzingData/FlattenedTables/FlattenedTables.htm
Thanks, I'll give it a try
Hi phil2,
This could just as well go into Vertica Tips ...
In this comment, which prepares example data, and my next one, which is its continuation containing the actual query, you can copy-paste the code snippets below into VSQL to test their behaviour, and see if they satisfy your needs :
If I understand you right, you use 6th- normal form - which is very much the on-change-only storage approach in sensor data, where you don't store one column per type of measure, but a measure-type-indicator and a value column, and you keep a measure-type dimension table like so for explanation (names for it can be "device", "measure_type", "signal" or others):
And the sensor data table would look like this:
This is the data preparation phase for this post.
My next comment will contain how to query this.
Bear with me ....
Marco
Hi again, Phil2.
In the data prepared in my previous post above, note that , for each test drive (drv_id), we only have 3 measures for outside temperature, as that changes only sporadically, while we have more frequent changes for airmass and rpm.
The quickest way to get the measures side by side, as a colleague of mine, Maurizio, and I , found out, is not Vertica's event series join, but the more "classical" horizontal pivoting approach, where we collapse the vertical structure, in our case by vehicle id, test drive id and timestamp. We will not like it at the beginning, as the resulting table will come out sparse / ragged:
And the result is:
As the storage is on-change-only, it's correct to assume that each NULL value will correspond to the last known value per vehicle id and test drive id for the measure in question. To reflect that, we can use the same OLAP window, which we name "w", for all three measures, and apply a LAST_VALUE( IGNORE NULLS) OVER (w) for them. So we add that bit to the query (that's why I formulated the main query above as a Common Table Expression in the first place):
And the result this time, is:
If this covers what you need, it does not matter if you have 3 or 100 different measure types ...
Happy Playing ...
Marco
I pasted the wrong output of the last query. The right one is:
Marco
Thanks for your advise. It is too dramatic change for me. There are too many places referencing those tables. The whole DWH built on it, so it is not an option for me. Even more, those tables are 100M+ to 1B+ tables, so I'm just not sure if it is a good idea to store all of them in a one EAV super-table.
After all I just group them by frequency, join single attributes in groups in parallel and then join groups together. Looks like it is the best that I can get of it. It seems that slowness of joins comes out of early materialization of that grouped tables when they are turning into inner tables in a final join and all the nulls in sparsed columns. Looks like grouping by frequency saves the day.
It would've been much easier it there were a function to put together same cardinality tables as is. Or if there were a join optimization for joining tables having absolutely the same join key column so that there would've been no need for materialization during join at all. I believe that such an operation would've been extremely fast.