Help with How to deal with "DDL statement interfered with this statement" .

kfruchtmankfruchtman Vertica Customer

Hi Guys!
I need help with a bulk load.I have a python script pumping data from a server outside to my vertica db every minute.The script is creating a cvs file and inserts it to a temp table.After that it does "MERGE" command with both tables (tmp and main tables).This way if there are duplicates it doesn't insert them.(Merge command inserts only rows that don't exist already in the main table).
A year ago it was working great but now there are over 3 billion rows in the main table.There are too many rows updated each minute and the Merge command is getting interrupted with another Merge command and that way the inserts are not made fully and I get many many failed queries with "DDL statement interfered with this statement" error.
I was wondering of a workaround for this situation.
If I will not let a couple of Merge commands run parallel the data will be uploaded too slow.

If I do "copy" command will the file be thrown out in a duplicate row case or just get errors? how do I check that?
I need to insert thousands (sometimes millions) of rows all day long to my main table with the option to make sure that duplicates are not created.It should run in a relatively fast way without getting the above error of a command locking the main table.
I don't know if it helps but my main table has a pk constraint projection.

Is there a smart way to do so?
any help will be greatly appreciated.

Kind regards,
Keren

Answers

  • moshegmosheg Vertica Employee Administrator

    Q - If I do "copy" command will the file be thrown out in a duplicate row case or just get errors?
    A - The whole COPY statement will fail.

    Please consider the following..
    1. Using sequences for primary key columns to guarantee uniqueness can avoid the resource overhead that primary key constraints can incur.
    2. Optimized MERGE is much faster than a non-optimized MERGE statement.
    See: https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/AdministratorsGuide/Tables/MergeTables/OptimizedVsNonOptimizedMerge.htm
    3. To make COPY statements load faster add nodes to the cluster
    4. To avoid DELETEs, UPDATEs and MERGE you can use Live Aggregate Projection (LAP) TOP-K with the last version of the data.

    create table stage_table  ( id int not null constraint stg_pk primary key enabled, val varchar(10), ts timestamp);
    copy stage_table from STDIN abort on error;
    1|one|2022-02-01
    2|two|2022-02-02
    3|three|2022-02-03
    \.
    
    -- When trying to insert duplicate rows while the table has constraint on primary key
    -- The following will get a violate constraint error and the whole COPY will fail
    copy stage_table from STDIN abort on error;
    2|two|2022-02-02
    4|four|2022-02-04
    \.
    vsql:duplicate_check.sql:18: ERROR 6745:  Duplicate key values: 'id=2' -- violates constraint 'DEMO_DUP.stage_table.stg_pk'
    
    -- We can disable the constraint:
    ALTER TABLE stage_table ALTER CONSTRAINT stg_pk DISABLED;
    
    -- And then the COPY statement will load also duplicates
    copy stage_table from STDIN abort on error;
    2|two|2022-02-02
    4|four|2022-02-04
    \.
    
    -- Check ID record number for all records:
    SELECT id, val, ts, ROW_NUMBER()  OVER(PARTITION BY id) as RN FROM stage_table ORDER BY 1;
     id |  val  |         ts          | RN
    ----+-------+---------------------+----
      1 | one   | 2022-02-01 00:00:00 |  1
      2 | two   | 2022-02-02 00:00:00 |  1
      2 | two   | 2022-02-02 00:00:00 |  2
      3 | three | 2022-02-03 00:00:00 |  1
      4 | four  | 2022-02-04 00:00:00 |  1
    (5 rows)
    
    -- Fetch unique records based upon the latest timestamp column:
    SELECT * FROM ( SELECT id, val, ts, ROW_NUMBER()  OVER(PARTITION BY id order by ts desc) RN FROM stage_table ) AS tb WHERE tb.RN = 1 ORDER BY 1;
     id |  val  |         ts          | RN
    ----+-------+---------------------+----
      1 | one   | 2022-02-01 00:00:00 |  1
      2 | two   | 2022-02-02 00:00:00 |  1
      3 | three | 2022-02-03 00:00:00 |  1
      4 | four  | 2022-02-04 00:00:00 |  1
    (4 rows)
    
    -- Which primary keys have constraint violation issues on a particular table?
    SELECT ANALYZE_CONSTRAINTS ('stage_table', 'id');
     Schema Name | Table Name  | Column Names | Constraint Name | Constraint Type | Column Values
    -------------+-------------+--------------+-----------------+-----------------+---------------
     DEMO_DUP    | stage_table | id           | stg_pk          | PRIMARY         | ('2')
    (1 row)
    
    -- Which keys have duplicate records?
    SELECT id, count(*) cnt FROM stage_table group by id  having count(*) > 1;
     id | cnt
    ----+-----
      2 |   2
    (1 row)
    
    -- We can create a LAP which will always give the latest staging table data
    CREATE PROJECTION stage_table_topk ( id, ts, val) as SELECT  id, ts, val FROM stage_table LIMIT 1 OVER (PARTITION BY  id ORDER BY ts DESC);
    SELECT REFRESH('stage_table');
    
    SELECT * FROM stage_table ORDER BY 1;
     id |  val  |         ts
    ----+-------+---------------------
      1 | one   | 2022-02-01 00:00:00
      2 | two   | 2022-02-02 00:00:00
      2 | two   | 2022-02-02 00:00:00
      3 | three | 2022-02-03 00:00:00
      4 | four  | 2022-02-04 00:00:00
    (5 rows)
    
    SELECT * FROM stage_table_topk ORDER BY 1;
     id |         ts          |  val
    ----+---------------------+-------
      1 | 2022-02-01 00:00:00 | one
      2 | 2022-02-02 00:00:00 | two
      3 | 2022-02-03 00:00:00 | three
      4 | 2022-02-04 00:00:00 | four
    (4 rows)
    

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file