Updating multi billion row table with 200+ fields

kannan_1 · August 2014

Hi,

I am trying to update a multi billion row table with updates for 200+ fields.

We tried to do single column updates as well as CTAS-ing to a new table with updated field using case. Both the methods are not working. I mean the jobs ran for more than a day and we killed it.

The issue we have is that, the table contains fields with character NULL data , instead of actual NULL.

If any one coould provide a better solution that would be wonderful.

norbert_krupa · August 2014

Test

norbert_krupa · August 2014

UPDATE is rather costly as it performs a DELETE, then INSERT. Like you mentioned, you could CTAS, but you should write a CASE statement into the SELECT, and partition the operation into smaller pieces to avoid a single pass UPDATE. If you anticipate you'll be performing UPDATE statements in the future, you should try to optimize your projections for this operation.

eli_revach · August 2014

Hi ,

Please provide some info about your cluster setup .
In addition , what is the percentage of data that need to be update ? do you have any partition setup on your table ?

eli_revach · August 2014

If your amounts of records that need to be update is relatively small comparing to the total’s that you have , I will advise the below method :

· Insert into temp table all the records that need to be update with all the case expression you mention , so on target table you will have the updated rows.

· Delete from source base on the keys that exists on the temp table .

· Insert from temp to source .

You can run this method in parallel such each thread will work on different population . As I see it will minimize the about of delete vectors that Vertica will create + will optimize replay delete by TM , please make sure you run purge_table after each iteration otherwise you will performance degradation on the next iterations .

Thanks

Sajan_Sangraula · August 2014

We have had a similar scenario, and here's what we found to be the optimal approach:

Partition the table that clusters the data to be updated into reasonable chunks. e.g. partition by month
Move partition from the source table to a new table using: select move_partitions_to_table('source','month_key','month_key','new_table_00x');
Run updates on the single partition.
Move the partition back to the primary table: select move_partitions_to_table('new_table_00x','month_key','month_key','source');
Drop new table.
Iterate with all partitions that need an update.

/Sajan

abtsantosbenten · September 2018

This is a life saver! Thank you Sajan_Sangraula!!!

Updating multi billion row table with 200+ fields

Comments

Leave a Comment