We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Updating multi billion row table with 200+ fields — Vertica Forum

Updating multi billion row table with 200+ fields

Hi,

I am trying to update a multi billion row table with updates for 200+ fields.

We tried to do single column updates as well as CTAS-ing to a new table with updated field using case. Both the methods are not working. I mean the jobs ran for more than a day and we killed it.

The issue we have is that, the table contains fields with character NULL data , instead of actual NULL.

If any one coould provide a better solution that would be wonderful.

Comments

  • UPDATE is rather costly as it performs a DELETE, then INSERT. Like you mentioned, you could CTAS, but you should write a CASE statement into the SELECT, and partition the operation into smaller pieces to avoid a single pass UPDATE. If you anticipate you'll be performing UPDATE statements in the future, you should try to optimize your projections for this operation.
  • Hi ,

    Please provide some info about your cluster setup . 
    In addition , what is the percentage of  data that need to be update ?    do you have any partition setup on your table ?

  • If your amounts of records that need to be update is relatively small comparing to the total’s that you have  , I will advise the below method :

     

    ·       Insert into temp table all the records that need to be update  with all the case expression you mention , so on target table you will have the updated rows.

    ·       Delete from source base on the keys that exists on the temp table .

    ·       Insert from temp to source .

     

    You can run this method in parallel such each thread will work on different population . As I see it will minimize the about of delete vectors  that Vertica will create + will optimize replay delete by TM , please make sure you run purge_table after each iteration otherwise you will performance degradation   on the next iterations .

     

    Thanks  
  • We have had a similar scenario, and here's what we found to be the optimal approach:
    1. Partition the table that clusters the data to be updated into reasonable chunks. e.g. partition by month
    2. Move partition from the source table to a new table using: select move_partitions_to_table('source','month_key','month_key','new_table_00x');
    3. Run updates on the single partition.
    4. Move the partition back to the primary table: select move_partitions_to_table('new_table_00x','month_key','month_key','source');
    5. Drop new table.
    6. Iterate with all partitions that need an update.
    /Sajan

  • This is a life saver! Thank you Sajan_Sangraula!!!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file