Slow Performance on Flattened Facts Table
Hi, hope someone had a similar issue. I have facts table with about 400 mln records and 20 columns. Client requested to provide information from facts table and additional attributes from 4 Dim tables.
DimA - 21 attributes
DimB - 5 atributes
DimC - 4 attributes
DimD - 2 attributes.
My first instinct was to create a view with all the joins to Dim tables to bring relevant data into one structure. I noticed that as I was adding joins from facts table to dimensions, the performance of the view was gradually degrading (I started adding the lowest DimD to the highest DimA).
The performance on the final view, with all 4 Dim added, was unusable for the client to work.
My second approach was to build a flat table with all facts and all relevant attributes added as columns via SET USING function. A/C to Vertica documentation, this is recommended setup on very large table; however, REFRESH_COLUMNS takes hours (>3 hrs now) to populate! This is again unacceptable.
I plan to change the column setting on attributes from DimB-C to use DEFAULT instead of SET USING, but DimA is very volatile (meaning attributes in that table get updated daily and are very likely to change). Client requests to have these attributes up_to date on entire Facts data.
Any recommendations are appreciated. How to make either views or flattened tabel work faster and more efficient.
Answers
Hi,
The Flattened Table feature was designed based on a use case of small dimension tables. FT always puts the dimension table as the inner input to the join to make sure data integrity enforced. They have to fit in memory. Are you using large dimension tables?
Note that there is an active project to improve FT performance on large dimension tables. I'll keep this post updated on the progress. Right now it looks like a new feature that we'll see this summer!
Jim, thanks for the response. Facts table is about 350M rows; while dimensions are:
DimA - 8.5M
DimB - 30K
DimC - 10.5M
DimD - 100 rows.
I eliminated column lookups on DimB and DImD, instead use join when loading data to the flattened table. For DimA and C, I created a procedure to identify the changes and then run REFRESH_COLUMNS on that column only. Seems to be a bit faster than using REFRESH_COLUMNS on entire table.