Deleting Duplicates based on three columns

harish_babu_yar · September 2013

DELETE FROM t WHERE rowid NOT IN (SELECT max(rowid) FROM t GROUP BY c1, TRIM(c2), TRIM(c3)); the above script is in oracle i need to apply this in vertica. as there is no rowid concept in vertica how can i acheive this

[Deleted User] · September 2013

If you need each row in your table to have a unique identifier, in Vertica you do that explicitly -- you could add a "rowid IDENTITY" column to your table definition; then the above query would work. Unfortunately, it is not straightforward to add an IDENTITY column to an existing table in Vertica; but you can achieve the same effect by adding a sequence, then adding an INTEGER column with a default value populated by the sequence. (Note that, unlike in Oracle, in Vertica since we're a column-store database, extra columns have zero performance impact on queries that don't access those columns.) I should also give our standard warning for people who ask about DELETE -- DELETE in Vertica has a dramatically bigger (negative) performance impact on the whole system than in Oracle or other unsorted / row-store products. (You'll see slower queries on tables with deleted data; also much slower recovery if a node goes down. And, of course, DELETE itself isn't exactly super-fast.) There are a variety of alternative (much faster) techniques discussed elsewhere in these forums; feel free to look around or to post a question if you're finding that processes tuned for Oracle aren't performing as well as you'd like in Vertica.

[Deleted User] · September 2013

Incidentally, speaking of alternative ways of looking at this problem, how about some variant of: CREATE TABLE new_table AS SELECT DISTINCT* FROM t; Then swap the new table in for the old. I think you'll actually find that this is much faster than the DELETE approach, if you're removing any significant number of duplicates. And it doesn't require creating your own "id" column or similar.

harish_babu_yar · September 2013

but this is everyday process i need to find the duplicates n remove them everyday. This query runs everyday

[Deleted User] · September 2013

In that case, what's introducing the duplicates? If this is that you load a new block of data every day, then you could look into an ELT process -- load the data into a temporary table, then do an "INSERT INTO t SELECT DISTINCT * FROM tmp_table;" Or even use an external table rather than a temporary table, so you only do one pass over the data. If it's that you have a process that inherently introduces duplications and you're trying to clean them up regularly, I guess my suggestion would be to see if you can somehow avoid creating the duplicate records. In some sense, creating duplicates is doing more work in order to create yet more work for yourself :-) Though I realize that this can be hard to avoid in some cases. One option that you could play with -- have a table with duplicates; then have a view that does the "SELECT DISTINCT *" on top of that table. So the view will always be 100% distinct; up to the minute/second/etc. Because we always store our data sorted (unlike many other databases), with the right projection design, duplicate rows will always be stored adjacent to each other so are easy to identify, so the runtime deduplication would be much much cheaper than you're likely used to from other systems. This might let you get away with deduplicating the data on disk much less frequently, so you could afford to do something a little more expensive. That last case might not work well for your particular use case. I'm posting it mostly as an example, for you and for others reading this: Vertica and other systems all support SQL; you can do the same operations, run the same scripts, etc., in any of those systems. But Vertica thinks, so to speak, about problems a little differently than other systems; so you can find that approaches that are extremely slow in other databases actually do quite well in Vertica, and vice versa. Either approach will still work, as long as it's SQL; it's all about performance.

Daniel_Leybovic · September 2013

Hi! 1. You can use in ROW_NUMBER() OVER ([PARTITION BY ... [ORDER BY ...]]) with sub-query - it will count unique rows per partition, a concept similar to ORACLE ROWID (also you can define ORDER BY , recommended for DELETE optimization and stable results) 2. You can add a surrogate key (it's easily can be done with Vertica - add column with default value hash(c1, TRIM(c2), TRIM(c3)), it's will be a surrogate key) 3. Check: may be MERGE fits your requirements? (instead of delete duplicates, with MERGE you can update matched rows)

Daniel_Leybovic · September 2013

Something like follow(it's just example, I dont know your data)

  daniel=> select * from t;   c1 | c2 | c3 | c4   ----+----+----+----    1 |  1 |  1 | a    1 |  1 |  1 | b    1 |  1 |  1 | c    2 |  2 |  2 | a    2 |  2 |  2 | b    2 |  2 |  2 | c  (6 rows)

  daniel=> select c1,c2,c3,c4,row_number() over (partition by c1,c2,c3) rowid from t;   c1 | c2 | c3 | c4 | rowid   ----+----+----+----+-------    1 |  1 |  1 | a  |     1    1 |  1 |  1 | b  |     2    1 |  1 |  1 | c  |     3    2 |  2 |  2 | a  |     1    2 |  2 |  2 | b  |     2    2 |  2 |  2 | c  |     3  (6 rows)

  daniel=> select c1,c2,c3,c4 from (select c1,c2,c3,c4,row_number() over (partition by c1,c2,c3) rowid from t) tbl where rowid = 3;   c1 | c2 | c3 | c4   ----+----+----+----    1 |  1 |  1 | c    2 |  2 |  2 | c  (2 rows)

  daniel=> select c1,c2,c3,c4 from (select c1,c2,c3,c4,row_number() over (partition by c1,c2,c3) rowid from t) tbl where rowid = (select max(rowid) from (select row_number() over (partition by c1,c2,c3) rowid from t) tbl1 );   c1 | c2 | c3 | c4   ----+----+----+----    1 |  1 |  1 | c    2 |  2 |  2 | c  (2 rows)

  daniel=> select c1,c2,c3,c4 from (select c1,c2,c3,c4,row_number() over (partition by c1,c2,c3) rowid from t) tbl where rowid = (select max(cnt) from (select count(*) as cnt from t group by c1,c2,c3) tbl);   c1 | c2 | c3 | c4   ----+----+----+----    1 |  1 |  1 | c    2 |  2 |  2 | c  (2 rows)

Navin_C · September 2013

Hello, Maybe this method can help you suppose you have a table like this

   id | name | sal | location  ----+------------+-------+----------   1 | Tom | 50000 | NJ   1 | Tom | 50000 | NJ   2 | Jerry | 40000 | NJ   2 | jerry | 40000 | NJ   3 | Tim | 40000 | NJ   4 | Jim | 20000 |NJ

You can do something like this First create a temporary table with all the data which is latest and single instance of each record, so no duplicates exist

   create temporary table  Loc_temp_test on commit preserve rows as  select   id  , name   , sal   , location  from   (select   id  , name   , sal   , location  ROW_NUMBER() OVER(PARTITION BY name, location, sal  ORDER BY  id ASC) as rnk  from test) sub  where sub.rnk=1

Then Truncate the existing table test

   TRUNCATE TABLE test;

Then Load the table test with data from Temporary table created earlier

   Insert into test  select * form Loc_temp_test ;

This way you can load your target table with all unique records and delete all duplicates. There are other alternative ways of removing duplicates from a table. Hope this helps.

Brian_Lau · September 2013

Any rule-of-thumb guidelines for when we should DELETE rows and when we should create table then swap?

Daniel_Leybovic · September 2013

Something like:

Query performance after large deletes A large number of (unpurged) deleted rows can negatively affect query performance. To eliminate rows that have been deleted from the result, a query must do extra processing. If 10% or more of the total rows in a table have been deleted, the performance of a query on the table degrades. However, your experience may vary depending on the size of the table, the table definition, and the query. If a table has a large number of deleted rows, consider purging those rows to improve performance. For more information on purging, see Purging Deleted Data.

So rule is about 10% of data (as pointed, this number "may vary depending on the size of the table" ). A best way: you can do a simple test to find a number when "SELECT DISTINCT"/SWAP strategy will be faster than DELETE/PURGE.

Brian_Lau · September 2013

Thanks Daniel. Thanks also on the speed of your reply. This helps though I was wondering more about the speed of individual daily deduplications (as in the original question). When should we select distict/swap instead of deleting a few rows? When the tables are smaller than X order of magnitude as large than the deletes?

Daniel_Leybovic · September 2013

DELETEs - are complicated topic and Im not sure that I can give a good answers.

Brian_Lau · September 2013

Thanks for the reply Dan

harish_babu_yar · September 2013

Thank U so much everyone I will try all ur suggestions as I am in a development phase I can try everything and find out which fits in my etl. Thanks Thank u So Much

We're Moving!

Create My New Community Account Now

Deleting Duplicates based on three columns

Comments

Leave a Comment