Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

Deleting duplicates from a table

I have large tables which have duplicate records. I know we can delete duplicates by creating a temporary table and inserting distinct record there and then dropping the original table and renaming the temp table to original table.
But I have a service which may be writing to the original table via copy command with auto commit when would be performing this drop and rename approach. What will happen in such case? My application which writes data to the table can retry, but when I do a insert /* +direct+*/ into temp select distinct,... the app could be writing to the main table. The insert query would have inserted some data but meanwhile that app might have written new data to the main table, If i drop the main table and rename temp to main table, will there be any data loss? What would be the best practice to remove duplicates from a table where other application could still be writing to the same table.

Answers

  • You could try partitioning the table - by the element of time - and then

    CREATE TABLE my_copy LIKE my_orig INCLUDING PROJECTIONS;
    INSERT INTO my_copy SELECT DISTINCT * FROM my_orig WHERE my_partition_key='<whatever_it_is>';
    SELECT SWAP_PARTITIONS_BETWEEN_TABLES(
      'my_copy'
    . '<whatever_the_smallest_partition_key_is>'
    . '<whatever_the_greatest_partition_key_is>'
    , 'my_orig'
    );
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.