Yet Another Way to De-Duplicate Table Rows

Jim_Knicely · February 2019

To remove duplicate table rows it is typically better (i.e. faster) to create a temporary table that contains all of the distinct rows from a table, drop the original table, then rename the temp table to the original table’s name.

Example:

dbadmin=> SELECT * FROM dups;
c1 | c2
----+----
  1 | A
  1 | A
  1 | A
  2 | B
  3 | C
  3 | C
  4 | D
(7 rows)

dbadmin=> CREATE TABLE dups_new LIKE dups INCLUDING PROJECTIONS;
CREATE TABLE

dbadmin=> INSERT /*+ DIRECT */ INTO dups_new SELECT DISTINCT * FROM dups;
OUTPUT
--------
      4
(1 row)

dbadmin=> DROP TABLE dups;
DROP TABLE

dbadmin=> ALTER TABLE dups_new RENAME TO dups;
ALTER TABLE

dbadmin=> SELECT * FROM dups;
c1 | c2
----+----
  1 | A
  2 | B
  3 | C
  4 | D
(4 rows)

The issue with that solution is that you’ll need to be sure that the original table grants are restored if they exist.

For smaller tables that have duplicate rows, here is another method to remove them that doesn’t involve creating a new table.

dbadmin=> SELECT * FROM dups2;
c1 | c2
----+----
  1 | A
  1 | A
  1 | A
  2 | B
  3 | C
  3 | C
  4 | D
(7 rows)

dbadmin=> INSERT /*+ DIRECT */ INTO dups2 SELECT DISTINCT c1, c2 FROM dups;
OUTPUT
--------
      4
(1 row)

dbadmin=> DELETE /*+ DIRECT */ FROM dups2 WHERE epoch IS NOT NULL;
OUTPUT
--------
      7
(1 row)

dbadmin=> SELECT * FROM dups2;
c1 | c2
----+----
  1 | A
  2 | B
  3 | C
  4 | D
(4 rows)

dbadmin=> COMMIT;
COMMIT

This method works because the hidden table EPOCH column is NULL for each row inserted until you issue a COMMIT statement.

Helpful Link:
https://www.vertica.com/kb/Understanding-Vertica-Epochs/Content/BestPractices/Understanding-Vertica-Epochs.htm

Have fun!

mkheir · February 2019

Very nice !!

Many thanks Jim.

Yet Another Way to De-Duplicate Table Rows

Comments