Managing SCD2 tables on Vertica
Hi all,
On Vertica Big Data Conference 2020 there was a discussion in the session The Shortest Path to Vertica – Best Practices for Data Warehouse Migration and ETL regarding managing SCD2 dimensional tables in Vertica.
I am looking forward for the mentioned blog post and more info on optimize MERGE and swap partitions.
Thank you
Best Answers
-
marcothesane - Select Field - Administrator
Hi @Poslanik -
My blog involving optimized MERGE is here:
https://www.linkedin.com/pulse/said-veterans-tricky-maintain-even-vertica-batches-doubly-gessner-1f/
I'm still hassling to find the right point in time to add the same scenario usingSWAP_PARTITIONS_BETWEEN_TABLES()
. But, on the other hand, Maurizio's part in the presentation today should already convey how it should be done ...
Hope this helps for starters ...5 -
marcothesane - Select Field - Administrator
Hi @Poslanik ,
I've been thinking about this one for quite a while.I'm not actually sure if there's a one-size-fits-all here (I'm beginning to like that line attributed to Mike Stonbreaker a lot):
If the "Slowly" in SCD stands for once every few days in average, I'd be tempted to partition by yearmonth of
from_dt
(that'sYEAR(from_dt)*100+MONTH(from_dt)
), and to swap the partitions of the last 3 months, and treat the very few rows that go further into the past to be updated with theMERGE
approach.If "Slowly" stands for once every few years, however, I'd opt for the yearmonth of
chg_dt
, and also work on the last 3 partitions.As I'm mostly involved in rather new Vertica implementations, I have not been exposed enough to Vertica data warehouses with several years of age, and it's actually with experiences in those realms that you can gain a confident attitude on what's best to do here.
But I'm pretty confident, though, that one of the two above is almost always a good start.
Hope this helps --
Good luck !5
Answers
Hi @marcothesane ,
Thank You for sharing blog post.
For using
SWAP_PARTITIONS_BETWEEN_TABLES()
which column would SCD2 target dimension table be partitioned by?Thank You
Oh and I forgot:
You should try to gain, for each SCD, some experience, of how far into the past the effective dates of the incoming, daily or so, delta , go for each incremental load. Base your decision on that....
Cheers -
Hi @marcothesane ,
Thank you very much for the explanation. I understand the idea about
SWAP_PARTITIONS_BETWEEN_TABLES().
Your blog helped me a lot in writing a query for handling SCD load. I did some additional things, like introducing columns which have
MD5()
calculation of all SCD1 and SCD2 columns. That way it is easier to check if there were changes in SCD2 columns in comparison to previous record and if there were no changes in SCD2 columns a record will be filtered out.Also, metadata table was introduced and query is dynamically built in Java because Talend is used as an ETL tool.
I can share a query with you if you are interested.
Just one more thing that popped my mind: What do we need to pay attention on because of
/*+DIRECT*/
hint in merge statement? I readto many ROS container
errors could appear.Thank you!
Hi @marcothesane ,
Thank you for your feedback. Now I understand the idea of
SWAP_PARTITIONS_BETWEEN_TABLES()
in SCD table load.Your blog helped me a lot in building a query for SCD table load. I did some additions like introducing two new columns. Each column has
MD5()
calculation of concatenation of all SCD1 and SCD2 column values. That way it is easier to check if there were changes in new records on SCD1 and/or SCD2 columns and if needed filter out records which didn't change.I can share a query with you if you are interested.
Just one more thing that popped on my mind: Do we need to pay attention on Vertica DB because of
/*+DIRECT*/
hint in merge statement? I readERROR: Too many ROS containers
could appear.Thank you!
On the
/*+DIRECT */
hint:Starting with Vertica 9.3, it's superfluous anyway. We are beginning to do without the WOS altogether.
The basic approach in Vertica, is always: Very few, very big DML transactions. We're a Big Data RDBMS platform, and optimised for it . Something good should come out of that fact ...
I actually have my own way of keeping metadata to encourage an automatic build of these SCD maintenance jobs:
I start from the staging table, which should have the format:
From
v_catalog.columns
, I select all columns of that staging table, create a derived table and add acolumn_role
column, containing one of:ARRAY ['natural_id','change_ts', 'tp1','tp2']
, as the case may be.Then, I can generate the
CREATE TABLE
of the actual SCD table, by adding a%_key
column as the first column, rolesurr_key
, changing the role of the stagingchange_ts
(orchange_dt
as the case may be) tofrom _ts
orfrom_dt
respectively. The second column becomes thenatural_id
, the third becomes e.gfrom_dt
, the fourth is added as e.g.to_dt
, the fifth is added asis_current Boolean
, and the rest of the columns are the columns from the staging table, starting from the 3rd staging column.Once that's done, all scripts I use in the Blog can be generated, using SQL generating SQL.
It works, but in the shape it is, it is probably only obvious to one with as a distorted mind as myself, so I never got round publishing it ...
Happy playing ...
Marco