INSERT IGNORE equivalent for Vertica?

Our data input stream has a lot of duplicates that need to be filtered on the database leayer.
We are using a staging table without constraints and a deduplicated table which is periodically created via

CREATE TABLE IF NOT EXISTS tableName_dedup
AS SELECT column1, column2, ...
FROM (SELECT *, row_number OVER (PARTITION BY column1, column2)
as rownum FROM tableName_import) import
where import.rownum = 1;

This however takes forever, because the SELECT has poor performance. It would be much faster to simply ignore the duplicates and only insert the first one.
I can't seem to get it to work via a MERGE, because there is no "ON MATCHED IGNORE" or similar.

Is there a way to do this?

Comments

  • marcothesanemarcothesane - Select Field - Administrator

    Hi
    You can just omit the WHEN MATCHED THEN UPDATE clause, like so:

    MERGE /*+DIRECT*/ 
    INTO  d_teas_scd  t 
    USING d_teas_stg  s
       ON s.tea_key = t.tea_key
    WHEN NOT matched THEN INSERT (
      tea_key
    , tea_id
    , tea_eff_dt
    , tea_end_dt
    , tea_name
    , tea_price
    )
    VALUES (
      s.tea_key
    , s.tea_id
    , s.tea_eff_dt
    , s.tea_end_dt
    , s.tea_name
    , s.tea_price
    );
    

    And if you want to use OLAP functions - then try the analytic LIMIT clause, which goes:

    SELECT
      *
    FROM d_teas_scd
    LIMIT 1 OVER(PARTITION BY tea_id ORDER BY tea_eff_dt DESC)
    

    ;

    Pretty efficient ...

    Good luck -

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited June 2019

    I dont think the MERGE command will work if the dups include all of the columns in the table :)

    dbadmin=> SELECT * FROM has_dups;
     c1 | c2
    ----+----
      2 | B
      2 | B
      3 | C
      1 | A
      1 | A
      1 | A
    (6 rows)
    
    dbadmin=> MERGE /*+ DIRECT */ INTO no_dups USING has_dups ON has_dups.c1 = no_dups.c1 AND has_dups.c2 = no_dups.c2
    dbadmin->   WHEN NOT MATCHED THEN INSERT (c1, c2) VALUES (has_dups.c1, has_dups.c2);
     OUTPUT
    --------
          6
    (1 row)
    
    dbadmin=> SELECT * FROM no_dups;
     c1 | c2
    ----+----
      2 | B
      2 | B
      3 | C
      1 | A
      1 | A
      1 | A
    (6 rows)
    
    dbadmin=> rollback;
    ROLLBACK
    
    dbadmin=> SELECT * FROM no_dups;
     c1 | c2
    ----+----
    (0 rows)
    

    Per @marcothesane recommnedation, the LIMIT analytic function is the way to go!

    dbadmin=> INSERT /*+ DIRECT */ INTO no_dups SELECT * FROM has_dups LIMIT 1 OVER (PARTITION BY c1, c2 ORDER BY 1);
     OUTPUT
    --------
          3
    (1 row)
    
    dbadmin=> SELECT * FROM no_dups;
     c1 | c2
    ----+----
      1 | A
      2 | B
      3 | C
    (3 rows)
    
  • The LIMIT function does not yield better performance sadly.

    The import table has a VARCHAR(50000) column which apparently causes the slowdown. When i remove this column and run the query again it finishes in a matter of seconds.
    Does this maybe give you a hint on how to improve performance?

    Thank you!!

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited June 2019

    Ha. Yeah, big VARCHARs tend to be a problem in an analytic database. Is the data in that column using all 50K? If so, a text index might help. Although I am not sure a text index would help when using the LIMIT function. I know they're a huge help for WHERE clause predicates.Maybe give it a whirl?

    See:
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Statements/CREATETEXTINDEX.htm

    If that doesn't help, do you need to look at the entire string to determine duplicates? That is, maybe you can check just the first 50 or so characters using the SUBSTR or LEFT functions?

  • @Jim_Knicely Thanks for your answer. I forgot to write, that the big column is not even used when determining if columns are duplicates. So to be more exact:

    CREATE TABLE IF NOT EXISTS tableName_dedup
    AS SELECT small_column1, small_column2, very_big_column
    FROM (SELECT *, row_number OVER (PARTITION BY column1, column2)
    as rownum FROM tableName_import) import
    where import.rownum = 1;
    

    I copied the table via copy_table command and deleted the very_big_column and afterwards the CREATE TABLE ... AS SELECT ran in roughly 2 seconds instead of up to 10 minutes.
    The very_big_column has VARCHARS with lengths up to 20k, but mostly < 1k.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file