INSERT IGNORE equivalent for Vertica?

wederer_intence · June 2019

Our data input stream has a lot of duplicates that need to be filtered on the database leayer.
We are using a staging table without constraints and a deduplicated table which is periodically created via

CREATE TABLE IF NOT EXISTS tableName_dedup
AS SELECT column1, column2, ...
FROM (SELECT *, row_number OVER (PARTITION BY column1, column2)
as rownum FROM tableName_import) import
where import.rownum = 1;

This however takes forever, because the SELECT has poor performance. It would be much faster to simply ignore the duplicates and only insert the first one.
I can't seem to get it to work via a MERGE, because there is no "ON MATCHED IGNORE" or similar.

Is there a way to do this?

marcothesane · June 2019

Hi
You can just omit the WHEN MATCHED THEN UPDATE clause, like so:

MERGE /*+DIRECT*/ 
INTO  d_teas_scd  t 
USING d_teas_stg  s
   ON s.tea_key = t.tea_key
WHEN NOT matched THEN INSERT (
  tea_key
, tea_id
, tea_eff_dt
, tea_end_dt
, tea_name
, tea_price
)
VALUES (
  s.tea_key
, s.tea_id
, s.tea_eff_dt
, s.tea_end_dt
, s.tea_name
, s.tea_price
);

And if you want to use OLAP functions - then try the analytic LIMIT clause, which goes:

SELECT
  *
FROM d_teas_scd
LIMIT 1 OVER(PARTITION BY tea_id ORDER BY tea_eff_dt DESC)

;

Pretty efficient ...

Good luck -

Jim_Knicely · June 2019

I dont think the MERGE command will work if the dups include all of the columns in the table

dbadmin=> SELECT * FROM has_dups;
 c1 | c2
----+----
  2 | B
  2 | B
  3 | C
  1 | A
  1 | A
  1 | A
(6 rows)

dbadmin=> MERGE /*+ DIRECT */ INTO no_dups USING has_dups ON has_dups.c1 = no_dups.c1 AND has_dups.c2 = no_dups.c2
dbadmin->   WHEN NOT MATCHED THEN INSERT (c1, c2) VALUES (has_dups.c1, has_dups.c2);
 OUTPUT
--------
      6
(1 row)

dbadmin=> SELECT * FROM no_dups;
 c1 | c2
----+----
  2 | B
  2 | B
  3 | C
  1 | A
  1 | A
  1 | A
(6 rows)

dbadmin=> rollback;
ROLLBACK

dbadmin=> SELECT * FROM no_dups;
 c1 | c2
----+----
(0 rows)

Per @marcothesane recommnedation, the LIMIT analytic function is the way to go!

dbadmin=> INSERT /*+ DIRECT */ INTO no_dups SELECT * FROM has_dups LIMIT 1 OVER (PARTITION BY c1, c2 ORDER BY 1);
 OUTPUT
--------
      3
(1 row)

dbadmin=> SELECT * FROM no_dups;
 c1 | c2
----+----
  1 | A
  2 | B
  3 | C
(3 rows)

wederer_intence · June 2019

The LIMIT function does not yield better performance sadly.

The import table has a VARCHAR(50000) column which apparently causes the slowdown. When i remove this column and run the query again it finishes in a matter of seconds.
Does this maybe give you a hint on how to improve performance?

Thank you!!

Jim_Knicely · June 2019

Ha. Yeah, big VARCHARs tend to be a problem in an analytic database. Is the data in that column using all 50K? If so, a text index might help. Although I am not sure a text index would help when using the LIMIT function. I know they're a huge help for WHERE clause predicates.Maybe give it a whirl?

See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/CREATETEXTINDEX.htm

If that doesn't help, do you need to look at the entire string to determine duplicates? That is, maybe you can check just the first 50 or so characters using the SUBSTR or LEFT functions?

wederer_intence · June 2019

@Jim_Knicely Thanks for your answer. I forgot to write, that the big column is not even used when determining if columns are duplicates. So to be more exact:

CREATE TABLE IF NOT EXISTS tableName_dedup
AS SELECT small_column1, small_column2, very_big_column
FROM (SELECT *, row_number OVER (PARTITION BY column1, column2)
as rownum FROM tableName_import) import
where import.rownum = 1;

I copied the table via copy_table command and deleted the very_big_column and afterwards the CREATE TABLE ... AS SELECT ran in roughly 2 seconds instead of up to 10 minutes.
The very_big_column has VARCHARS with lengths up to 20k, but mostly < 1k.

We're Moving!

Create My New Community Account Now

INSERT IGNORE equivalent for Vertica?

Comments

Leave a Comment