Deleting duplicates from a table

sverma28 · May 2020

I have large tables which have duplicate records. I know we can delete duplicates by creating a temporary table and inserting distinct record there and then dropping the original table and renaming the temp table to original table.
But I have a service which may be writing to the original table via copy command with auto commit when would be performing this drop and rename approach. What will happen in such case? My application which writes data to the table can retry, but when I do a insert /* +direct+*/ into temp select distinct,... the app could be writing to the main table. The insert query would have inserted some data but meanwhile that app might have written new data to the main table, If i drop the main table and rename temp to main table, will there be any data loss? What would be the best practice to remove duplicates from a table where other application could still be writing to the same table.

marcothesane · May 2020

You could try partitioning the table - by the element of time - and then

CREATE TABLE my_copy LIKE my_orig INCLUDING PROJECTIONS;
INSERT INTO my_copy SELECT DISTINCT * FROM my_orig WHERE my_partition_key='<whatever_it_is>';
SELECT SWAP_PARTITIONS_BETWEEN_TABLES(
  'my_copy'
. '<whatever_the_smallest_partition_key_is>'
. '<whatever_the_greatest_partition_key_is>'
, 'my_orig'
);

We're Moving!

Create My New Community Account Now

Deleting duplicates from a table

Answers

Leave a Comment