Managing SCD2 tables on Vertica

Poslanik · April 2020

Hi all,
On Vertica Big Data Conference 2020 there was a discussion in the session The Shortest Path to Vertica – Best Practices for Data Warehouse Migration and ETL regarding managing SCD2 dimensional tables in Vertica.
I am looking forward for the mentioned blog post and more info on optimize MERGE and swap partitions.
Thank you

marcothesane · April 2020

Hi @Poslanik -
My blog involving optimized MERGE is here:
https://www.linkedin.com/pulse/said-veterans-tricky-maintain-even-vertica-batches-doubly-gessner-1f/
I'm still hassling to find the right point in time to add the same scenario using SWAP_PARTITIONS_BETWEEN_TABLES(). But, on the other hand, Maurizio's part in the presentation today should already convey how it should be done ...
Hope this helps for starters ...

marcothesane · April 2020

Hi @Poslanik ,
I've been thinking about this one for quite a while.

I'm not actually sure if there's a one-size-fits-all here (I'm beginning to like that line attributed to Mike Stonbreaker a lot):

If the "Slowly" in SCD stands for once every few days in average, I'd be tempted to partition by yearmonth of from_dt (that's YEAR(from_dt)*100+MONTH(from_dt) ), and to swap the partitions of the last 3 months, and treat the very few rows that go further into the past to be updated with the MERGE approach.

If "Slowly" stands for once every few years, however, I'd opt for the yearmonth of chg_dt, and also work on the last 3 partitions.

As I'm mostly involved in rather new Vertica implementations, I have not been exposed enough to Vertica data warehouses with several years of age, and it's actually with experiences in those realms that you can gain a confident attitude on what's best to do here.

But I'm pretty confident, though, that one of the two above is almost always a good start.

Hope this helps --
Good luck !

Poslanik · April 2020

Hi @marcothesane ,
Thank You for sharing blog post.
For using SWAP_PARTITIONS_BETWEEN_TABLES()which column would SCD2 target dimension table be partitioned by?
Thank You

marcothesane · April 2020

Oh and I forgot:

You should try to gain, for each SCD, some experience, of how far into the past the effective dates of the incoming, daily or so, delta , go for each incremental load. Base your decision on that....

Cheers -

Poslanik · April 2020

Hi @marcothesane ,
Thank you very much for the explanation. I understand the idea about SWAP_PARTITIONS_BETWEEN_TABLES().

Your blog helped me a lot in writing a query for handling SCD load. I did some additional things, like introducing columns which have MD5() calculation of all SCD1 and SCD2 columns. That way it is easier to check if there were changes in SCD2 columns in comparison to previous record and if there were no changes in SCD2 columns a record will be filtered out.
Also, metadata table was introduced and query is dynamically built in Java because Talend is used as an ETL tool.
I can share a query with you if you are interested.

Just one more thing that popped my mind: What do we need to pay attention on because of /*+DIRECT*/ hint in merge statement? I read to many ROS container errors could appear.
Thank you!

Poslanik · April 2020

Hi @marcothesane ,
Thank you for your feedback. Now I understand the idea of SWAP_PARTITIONS_BETWEEN_TABLES() in SCD table load.

Your blog helped me a lot in building a query for SCD table load. I did some additions like introducing two new columns. Each column has MD5()calculation of concatenation of all SCD1 and SCD2 column values. That way it is easier to check if there were changes in new records on SCD1 and/or SCD2 columns and if needed filter out records which didn't change.
I can share a query with you if you are interested.

Just one more thing that popped on my mind: Do we need to pay attention on Vertica DB because of /*+DIRECT*/ hint in merge statement? I read ERROR: Too many ROS containers could appear.

Thank you!

marcothesane · April 2020

On the /*+DIRECT */ hint:
Starting with Vertica 9.3, it's superfluous anyway. We are beginning to do without the WOS altogether.

The treatment of ROS containers has improved dramatically.
If you do the MERGE in one transaction, for example, you only get one ROS container - or one ROS container per table partition - per node anyway.

The basic approach in Vertica, is always: Very few, very big DML transactions. We're a Big Data RDBMS platform, and optimised for it . Something good should come out of that fact ...

marcothesane · April 2020

I actually have my own way of keeping metadata to encourage an automatic build of these SCD maintenance jobs:
I start from the staging table, which should have the format:

natural identifier -- part of (disabled) unique constraint / primary key
change date/timestamp -- part of (disabled) unique constraint / primary key
all Type I and all Type II columns.
From v_catalog.columns, I select all columns of that staging table, create a derived table and add a column_role column, containing one of: ARRAY ['natural_id','change_ts', 'tp1','tp2'] , as the case may be.
Then, I can generate the CREATE TABLE of the actual SCD table, by adding a %_key column as the first column, role surr_key, changing the role of the staging change_ts (or change_dt as the case may be) to from _ts or from_dt respectively. The second column becomes the natural_id, the third becomes e.g from_dt, the fourth is added as e.g. to_dt, the fifth is added as is_current Boolean, and the rest of the columns are the columns from the staging table, starting from the 3rd staging column.
Once that's done, all scripts I use in the Blog can be generated, using SQL generating SQL.
It works, but in the shape it is, it is probably only obvious to one with as a distorted mind as myself, so I never got round publishing it ...
Happy playing ...

Marco

We're Moving!

Create My New Community Account Now

Managing SCD2 tables on Vertica

Best Answers

Answers

Leave a Comment