INSERT IGNORE equivalent for Vertica?
Our data input stream has a lot of duplicates that need to be filtered on the database leayer.
We are using a staging table without constraints and a deduplicated table which is periodically created via
CREATE TABLE IF NOT EXISTS tableName_dedup AS SELECT column1, column2, ... FROM (SELECT *, row_number OVER (PARTITION BY column1, column2) as rownum FROM tableName_import) import where import.rownum = 1;
This however takes forever, because the SELECT has poor performance. It would be much faster to simply ignore the duplicates and only insert the first one.
I can't seem to get it to work via a MERGE, because there is no "ON MATCHED IGNORE" or similar.
Is there a way to do this?
0
Comments
Hi
You can just omit the WHEN MATCHED THEN UPDATE clause, like so:
And if you want to use OLAP functions - then try the analytic LIMIT clause, which goes:
;
Pretty efficient ...
Good luck -
I dont think the MERGE command will work if the dups include all of the columns in the table
Per @marcothesane recommnedation, the LIMIT analytic function is the way to go!
The
LIMIT
function does not yield better performance sadly.The
import
table has aVARCHAR(50000)
column which apparently causes the slowdown. When i remove this column and run the query again it finishes in a matter of seconds.Does this maybe give you a hint on how to improve performance?
Thank you!!
Ha. Yeah, big VARCHARs tend to be a problem in an analytic database. Is the data in that column using all 50K? If so, a text index might help. Although I am not sure a text index would help when using the LIMIT function. I know they're a huge help for WHERE clause predicates.Maybe give it a whirl?
See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/CREATETEXTINDEX.htm
If that doesn't help, do you need to look at the entire string to determine duplicates? That is, maybe you can check just the first 50 or so characters using the SUBSTR or LEFT functions?
@Jim_Knicely Thanks for your answer. I forgot to write, that the big column is not even used when determining if columns are duplicates. So to be more exact:
I copied the table via
copy_table
command and deleted thevery_big_column
and afterwards theCREATE TABLE ... AS SELECT
ran in roughly 2 seconds instead of up to 10 minutes.The
very_big_column
hasVARCHARS
with lengths up to 20k, but mostly < 1k.