Updating multi billion row table with 200+ fields
Hi,
I am trying to update a multi billion row table with updates for 200+ fields.
We tried to do single column updates as well as CTAS-ing to a new table with updated field using case. Both the methods are not working. I mean the jobs ran for more than a day and we killed it.
The issue we have is that, the table contains fields with character NULL data , instead of actual NULL.
If any one coould provide a better solution that would be wonderful.
I am trying to update a multi billion row table with updates for 200+ fields.
We tried to do single column updates as well as CTAS-ing to a new table with updated field using case. Both the methods are not working. I mean the jobs ran for more than a day and we killed it.
The issue we have is that, the table contains fields with character NULL data , instead of actual NULL.
If any one coould provide a better solution that would be wonderful.
0
Comments
UPDATE
is rather costly as it performs aDELETE
, thenINSERT
. Like you mentioned, you could CTAS, but you should write a CASE statement into theSELECT
, and partition the operation into smaller pieces to avoid a single passUPDATE
. If you anticipate you'll be performingUPDATE
statements in the future, you should try to optimize your projections for this operation.Please provide some info about your cluster setup .
In addition , what is the percentage of data that need to be update ? do you have any partition setup on your table ?
If your amounts of records that need to be update is relatively small comparing to the total’s that you have , I will advise the below method :
· Insert into temp table all the records that need to be update with all the case expression you mention , so on target table you will have the updated rows.
· Delete from source base on the keys that exists on the temp table .
· Insert from temp to source .
You can run this method in parallel such each thread will work on different population . As I see it will minimize the about of delete vectors that Vertica will create + will optimize replay delete by TM , please make sure you run purge_table after each iteration otherwise you will performance degradation on the next iterations .
- Partition the table that clusters the data to be updated into reasonable chunks. e.g. partition by month
- Move partition from the source table to a new table using: select move_partitions_to_table('source','month_key','month_key','new_table_00x');
- Run updates on the single partition.
- Move the partition back to the primary table: select move_partitions_to_table('new_table_00x','month_key','month_key','source');
- Drop new table.
- Iterate with all partitions that need an update.
/SajanThis is a life saver! Thank you Sajan_Sangraula!!!