INSERT IGNORE equivalent for Vertica?
Our data input stream has a lot of duplicates that need to be filtered on the database leayer.
We are using a staging table without constraints and a deduplicated table which is periodically created via
CREATE TABLE IF NOT EXISTS tableName_dedup AS SELECT column1, column2, ... FROM (SELECT *, row_number OVER (PARTITION BY column1, column2) as rownum FROM tableName_import) import where import.rownum = 1;
This however takes forever, because the SELECT has poor performance. It would be much faster to simply ignore the duplicates and only insert the first one.
I can't seem to get it to work via a MERGE, because there is no "ON MATCHED IGNORE" or similar.
Is there a way to do this?
0
Comments
Hi
You can just omit the WHEN MATCHED THEN UPDATE clause, like so:
And if you want to use OLAP functions - then try the analytic LIMIT clause, which goes:
;
Pretty efficient ...
Good luck -
I dont think the MERGE command will work if the dups include all of the columns in the table
dbadmin=> SELECT * FROM has_dups; c1 | c2 ----+---- 2 | B 2 | B 3 | C 1 | A 1 | A 1 | A (6 rows) dbadmin=> MERGE /*+ DIRECT */ INTO no_dups USING has_dups ON has_dups.c1 = no_dups.c1 AND has_dups.c2 = no_dups.c2 dbadmin-> WHEN NOT MATCHED THEN INSERT (c1, c2) VALUES (has_dups.c1, has_dups.c2); OUTPUT -------- 6 (1 row) dbadmin=> SELECT * FROM no_dups; c1 | c2 ----+---- 2 | B 2 | B 3 | C 1 | A 1 | A 1 | A (6 rows) dbadmin=> rollback; ROLLBACK dbadmin=> SELECT * FROM no_dups; c1 | c2 ----+---- (0 rows)Per @marcothesane recommnedation, the LIMIT analytic function is the way to go!
dbadmin=> INSERT /*+ DIRECT */ INTO no_dups SELECT * FROM has_dups LIMIT 1 OVER (PARTITION BY c1, c2 ORDER BY 1); OUTPUT -------- 3 (1 row) dbadmin=> SELECT * FROM no_dups; c1 | c2 ----+---- 1 | A 2 | B 3 | C (3 rows)The
LIMITfunction does not yield better performance sadly.The
importtable has aVARCHAR(50000)column which apparently causes the slowdown. When i remove this column and run the query again it finishes in a matter of seconds.Does this maybe give you a hint on how to improve performance?
Thank you!!
Ha. Yeah, big VARCHARs tend to be a problem in an analytic database. Is the data in that column using all 50K? If so, a text index might help. Although I am not sure a text index would help when using the LIMIT function. I know they're a huge help for WHERE clause predicates.Maybe give it a whirl?
See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/CREATETEXTINDEX.htm
If that doesn't help, do you need to look at the entire string to determine duplicates? That is, maybe you can check just the first 50 or so characters using the SUBSTR or LEFT functions?
@Jim_Knicely Thanks for your answer. I forgot to write, that the big column is not even used when determining if columns are duplicates. So to be more exact:
I copied the table via
copy_tablecommand and deleted thevery_big_columnand afterwards theCREATE TABLE ... AS SELECTran in roughly 2 seconds instead of up to 10 minutes.The
very_big_columnhasVARCHARSwith lengths up to 20k, but mostly < 1k.