Updating multi billion row table with 200+ fields

kannan_1kannan_1 Registered User
Hi,

I am trying to update a multi billion row table with updates for 200+ fields.

We tried to do single column updates as well as CTAS-ing to a new table with updated field using case. Both the methods are not working. I mean the jobs ran for more than a day and we killed it.

The issue we have is that, the table contains fields with character NULL data , instead of actual NULL.

If any one coould provide a better solution that would be wonderful.

Comments

  • norbert_krupanorbert_krupa Registered User
    Test
  • norbert_krupanorbert_krupa Registered User
    UPDATE is rather costly as it performs a DELETE, then INSERT. Like you mentioned, you could CTAS, but you should write a CASE statement into the SELECT, and partition the operation into smaller pieces to avoid a single pass UPDATE. If you anticipate you'll be performing UPDATE statements in the future, you should try to optimize your projections for this operation.
  • eli_revacheli_revach Registered User
    Hi ,

    Please provide some info about your cluster setup . 
    In addition , what is the percentage of  data that need to be update ?    do you have any partition setup on your table ?

  • eli_revacheli_revach Registered User

    If your amounts of records that need to be update is relatively small comparing to the total’s that you have  , I will advise the below method :

     

    ·       Insert into temp table all the records that need to be update  with all the case expression you mention , so on target table you will have the updated rows.

    ·       Delete from source base on the keys that exists on the temp table .

    ·       Insert from temp to source .

     

    You can run this method in parallel such each thread will work on different population . As I see it will minimize the about of delete vectors  that Vertica will create + will optimize replay delete by TM , please make sure you run purge_table after each iteration otherwise you will performance degradation   on the next iterations .

     

    Thanks  
  • Sajan_SangraulaSajan_Sangraula Registered User
    We have had a similar scenario, and here's what we found to be the optimal approach:
    1. Partition the table that clusters the data to be updated into reasonable chunks. e.g. partition by month
    2. Move partition from the source table to a new table using: select move_partitions_to_table('source','month_key','month_key','new_table_00x');
    3. Run updates on the single partition.
    4. Move the partition back to the primary table: select move_partitions_to_table('new_table_00x','month_key','month_key','source');
    5. Drop new table.
    6. Iterate with all partitions that need an update.
    /Sajan

  • abtsantosbentenabtsantosbenten Registered User

    This is a life saver! Thank you Sajan_Sangraula!!!

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file