Using Slowly Changing Dimension

Since client source data is not provided with correct end time period for SCD, I am trying to use start date as time interval and convert the scd table to view price listed for each date on separate line. Please check below scd table (note that product_id and start_date are PK):


The output of the view would be following:


Appreciate any feedback.

Tagged:

Answers

  • LenoyJLenoyJ - Select Field - Employee
    edited October 2019

    Looks like a use-case for Time Series Analytics?

    dbadmin=> select * from anu order by prdid,start_date;
     prdid  | start_date | price
    --------+------------+-------
     123456 | 2019-03-01 |   299
     123456 | 2019-07-29 |   299
     123456 | 2019-08-05 |   299
     123456 | 2019-08-12 |   269
     456789 | 2019-04-01 |   123
     456789 | 2019-08-29 |   123
     456789 | 2019-09-05 |   456
    (7 rows)
    

    .

    dbadmin=> SELECT prdid, slice_time::date as start_date, TS_FIRST_VALUE(price, 'CONST') price FROM anu TIMESERIES slice_time AS '1 day' OVER(PARTITION by prdid ORDER BY start_date::timestamp);
     prdid  | start_date | price
    --------+------------+-------
     123456 | 2019-03-01 |   299
     123456 | 2019-03-02 |   299
     123456 | 2019-03-03 |   299
     123456 | 2019-03-04 |   299
     123456 | 2019-03-05 |   299
     .
     .
     .
     123456 | 2019-07-27 |   299
     123456 | 2019-07-28 |   299
     123456 | 2019-07-29 |   299
     123456 | 2019-07-30 |   299
     .
     .
     .
     123456 | 2019-08-03 |   299
     123456 | 2019-08-04 |   299
     123456 | 2019-08-05 |   299
     123456 | 2019-08-06 |   299
     .
     .
     .
     123456 | 2019-08-11 |   299
     123456 | 2019-08-12 |   269
     456789 | 2019-04-01 |   123
     456789 | 2019-04-02 |   123
     .
     .
     .
     456789 | 2019-08-28 |   123
     456789 | 2019-08-29 |   123
     456789 | 2019-08-30 |   123
     .
     .
     .
     456789 | 2019-09-03 |   123
     456789 | 2019-09-04 |   123
     456789 | 2019-09-05 |   456
    (323 rows)
    
  • LenoyJLenoyJ - Select Field - Employee
    edited October 2019

    Also, if you're looking to fix end_date. See if using the lead analytical function would help?

    dbadmin=> select prdid,start_date, lead(start_date,1,GETDATE()+1) OVER (PARTITION BY prdid ORDER BY prdid,start_date)-1 as end_date, price from anu;
     prdid  | start_date |  end_date  | price
    --------+------------+------------+-------
     123456 | 2019-03-01 | 2019-07-28 |   299
     123456 | 2019-07-29 | 2019-08-04 |   299
     123456 | 2019-08-05 | 2019-08-11 |   299
     123456 | 2019-08-12 | 2019-10-11 |   269
     456789 | 2019-04-01 | 2019-08-28 |   123
     456789 | 2019-08-29 | 2019-09-04 |   123
     456789 | 2019-09-05 | 2019-10-11 |   456
    (7 rows)
    
  • marcothesanemarcothesane - Select Field - Administrator

    This depends on whether you want your product table to be an SCD table or not.
    If you have something like a fact table near it, I'd model the table like so:
    CREATE TABLE d_product (
    prdkey INT NOT NULL PRIMARY KEY DEFAULT HASH(prdid,start_date)
    , prdid INT NOT NULL
    , start_date DATE NOT NULL
    , end_date DATE NOT NULL
    , price NUMERIC(18,2)
    );

    And I'd fill it like so (sorry for the lack of monospace font, but I could simply not get it to display....):

    WITH
    your_input(prdid,start_date,price) AS (
    SELECT           123456,DATE '2019-03-01',299
    UNION ALL SELECT 123456,DATE '2019-07-29',299
    UNION ALL SELECT 123456,DATE '2019-08-05',299
    UNION ALL SELECT 123456,DATE '2019-08-12',269
    UNION ALL SELECT 456789,DATE '2019-04-01',123
    UNION ALL SELECT 456789,DATE '2019-08-29',123
    UNION ALL SELECT 456789,DATE '2019-09-05',456
    )
    ,
    in_w_prev_price AS (
    SELECT
      prdid
    , start_date
    , LAG (price)                     OVER(w) AS prev_price
    , price
    FROM your_input
    WINDOW w AS(PARTITION BY prdid ORDER BY start_date)
    )
    SELECT
      prdid
    , start_date
    , LEAD(start_date,1,'9999-12-31') OVER(w) AS end_date
    , price
    FROM in_w_prev_price
    WHERE prev_price IS NULL
       OR prev_price <> price
    WINDOW w AS(PARTITION BY prdid ORDER BY start_date)
    ;
    -- output of SELECT ...                                           
    -- out  prdid  | start_date |  end_date  | price 
    -- out --------+------------+------------+-------
    -- out  123456 | 2019-03-01 | 2019-08-12 |   299
    -- out  123456 | 2019-08-12 | 9999-12-31 |   269
    -- out  456789 | 2019-04-01 | 2019-09-05 |   123
    -- out  456789 | 2019-09-05 | 9999-12-31 |   456
    -- out (4 rows)
    

    If ever you can, remove redundancies.
    And use the one-photograph-per-day approach using the TIMESERIES clause, as suggested by @LenoyJ - of which I'm deeply fond, too, at reporting time only;

    Don't create views that produce much more rows than will ever be used in a report, and, by all means, don't produce tables that risk ending up containing millions of redundant rows ...

  • Great suggestions! Thank you very much. I have already created a view with lead statement to correct end date and joined product table to date table based on start / end date interval. Just curious, why would you state not to use views?
    In the example I provided, i gave the final pricing table; however, client provides us partial updates. Sometimes, these are not updates at all, as you can see, the start date might change, but price remains the same. In other cases, price changes; while start/end dates remain the same. I load all new data to a temp table (src) first and then use merge to product dim (tgt) to identify updates and inserts. I limit updates only to records witch changed end date or price. Inserts are defined by a new start date. After reloading all pricing files from client, i noticed that end date is incorrectly managed on client end and I need to fix it on my end. The best option I saw was to load the data as is (using merge) and then create a view on the product dim to correct end dates. Would you recommend a different approach?

  • marcothesanemarcothesane - Select Field - Administrator

    I don't mean to not use views in general. But a view that generates one row per day, even if a "product" changes, say, once a year, would be 364 rows too many.
    I would use:
    SELECT * FROM temp_table
    UNION ALL
    SELECT * FROM target_product_dim
    WHERE start_date >= (SELECT MIN(start_date) FROM temp_table);
    And then use the output of this select for the query I wrote above, removing all rows with equal prices, and replace the rows from the target table with the same time range with the output of my query above. Storing data on-change-only, and with no gaps in the timeline.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file