Managing SCD2 tables on Vertica

PoslanikPoslanik Vertica Customer

Hi all,
On Vertica Big Data Conference 2020 there was a discussion in the session The Shortest Path to Vertica – Best Practices for Data Warehouse Migration and ETL regarding managing SCD2 dimensional tables in Vertica.
I am looking forward for the mentioned blog post and more info on optimize MERGE and swap partitions.
Thank you

Best Answers

Answers

  • PoslanikPoslanik Vertica Customer

    Hi @marcothesane ,
    Thank You for sharing blog post.
    For using SWAP_PARTITIONS_BETWEEN_TABLES()which column would SCD2 target dimension table be partitioned by?
    Thank You

  • marcothesanemarcothesane - Select Field - Administrator

    Oh and I forgot:

    You should try to gain, for each SCD, some experience, of how far into the past the effective dates of the incoming, daily or so, delta , go for each incremental load. Base your decision on that....

    Cheers -

  • PoslanikPoslanik Vertica Customer
    edited April 2020

    Hi @marcothesane ,
    Thank you very much for the explanation. I understand the idea about SWAP_PARTITIONS_BETWEEN_TABLES().

    Your blog helped me a lot in writing a query for handling SCD load. I did some additional things, like introducing columns which have MD5() calculation of all SCD1 and SCD2 columns. That way it is easier to check if there were changes in SCD2 columns in comparison to previous record and if there were no changes in SCD2 columns a record will be filtered out.
    Also, metadata table was introduced and query is dynamically built in Java because Talend is used as an ETL tool.
    I can share a query with you if you are interested.

    Just one more thing that popped my mind: What do we need to pay attention on because of /*+DIRECT*/ hint in merge statement? I read to many ROS container errors could appear.
    Thank you!

  • PoslanikPoslanik Vertica Customer

    Hi @marcothesane ,
    Thank you for your feedback. Now I understand the idea of SWAP_PARTITIONS_BETWEEN_TABLES() in SCD table load.

    Your blog helped me a lot in building a query for SCD table load. I did some additions like introducing two new columns. Each column has MD5()calculation of concatenation of all SCD1 and SCD2 column values. That way it is easier to check if there were changes in new records on SCD1 and/or SCD2 columns and if needed filter out records which didn't change.
    I can share a query with you if you are interested.

    Just one more thing that popped on my mind: Do we need to pay attention on Vertica DB because of /*+DIRECT*/ hint in merge statement? I read ERROR: Too many ROS containers could appear.

    Thank you!

  • marcothesanemarcothesane - Select Field - Administrator

    On the /*+DIRECT */ hint:
    Starting with Vertica 9.3, it's superfluous anyway. We are beginning to do without the WOS altogether.

    • The treatment of ROS containers has improved dramatically.
    • If you do the MERGE in one transaction, for example, you only get one ROS container - or one ROS container per table partition - per node anyway.

    The basic approach in Vertica, is always: Very few, very big DML transactions. We're a Big Data RDBMS platform, and optimised for it . Something good should come out of that fact ...

  • marcothesanemarcothesane - Select Field - Administrator

    I actually have my own way of keeping metadata to encourage an automatic build of these SCD maintenance jobs:
    I start from the staging table, which should have the format:

    • natural identifier -- part of (disabled) unique constraint / primary key
    • change date/timestamp -- part of (disabled) unique constraint / primary key
    • all Type I and all Type II columns.
      From v_catalog.columns, I select all columns of that staging table, create a derived table and add a column_role column, containing one of: ARRAY ['natural_id','change_ts', 'tp1','tp2'] , as the case may be.
      Then, I can generate the CREATE TABLE of the actual SCD table, by adding a %_key column as the first column, role surr_key, changing the role of the staging change_ts (or change_dt as the case may be) to from _ts or from_dt respectively. The second column becomes the natural_id, the third becomes e.g from_dt, the fourth is added as e.g. to_dt, the fifth is added as is_current Boolean, and the rest of the columns are the columns from the staging table, starting from the 3rd staging column.
      Once that's done, all scripts I use in the Blog can be generated, using SQL generating SQL.
      It works, but in the shape it is, it is probably only obvious to one with as a distorted mind as myself, so I never got round publishing it ...
      Happy playing ...

    Marco

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file