Effective way to bulk update with Merge
Hi,
I have a table containing sales data for last three years (hundreds of millions of rows). Source data gets minor changes every day in depth of last one-two months (error corrections) and I need to keep my vertica table in actual state. The idea is to update data for last two months every morning, using Merge, to keep data correct.
Maybe I'm mistaking about how Merge does its work, but as I understand it will delete ALL rows of period, despite of were changes occured or not, and perform full insert after. It doesn't look healthy strategy to rewrite (delete/insert) millions of the same rows every day just to actualize data in two of three rows.
In SQL Server i used "update only if changed hash on row" strategy, which includes adding hash column in the table, and merging (updating) on condition of inequality of hashes.
Is there any sense to do merge-on-inequal hash in Vertica? Adding such condition turns off optimized merge. Or it would be unnecesary overcomplicating of things? What is a common practice in such kind of updates in Vertica?
Best Answer
-
marcothesane - Select Field - Administrator
Hi Alex -
If your table looks like this:
CREATE TABLE merge_target ( prod_ky INT NOT NULL PRIMARY KEY , prod_price NUMERIC(18,2) );
, the staging table
merge_source
has the same structure, and your MERGE statement looks like this:MERGE /* +DIRECT */ INTO merge_target t USING merge_source s ON s.prod_ky=t.prod_ky WHEN MATCHED THEN UPDATE SET prod_ky=s.prod_ky -- this hurts - the identifier in the set clause- but this is what makes the difference ... , prod_price=s.prod_price WHEN NOT MATCHED THEN INSERT ( prod_ky , prod_price ) VALUES ( s.prod_ky , s.prod_price ) ;
... then you will have an optimized Merge, (a DELETE access plan and an INSERT access plan), but only on the rows existing in
merge_source
, not all in the target.Pre-requisite is that the target table has a primary key or a unique index - and that that one is used in the ON clause of the MERGE statement. And note that all columns of the table are used in both the Update and the Insert branch of the statement.
Consider partitioning the target table by yearmonth, creating a third table with the exact same structure, copying the partitions of the last two months into that table, applying the optimized MERGE to that table, and finally swapping partitions between that table and the final target table.
Check out the Vertica docu on SWAP_PARTITIONS_BETWEEN_TABLES() to see how that could work ...
Good luck -
Marco1
Answers
Thank you, Marco
The swap partitions technics seems like exactly what i need in this case. Never thought about this way. Cool idea