vertica constraint enforcement and deduplication

rajatpaliwal86rajatpaliwal86 Vertica Customer

We have a large facts table on which primary key constraint is enabled. However, we are observing issues with load performance. Since Vertica also suggests to have this constraint disabled on the large facts table so we are planning to disable it on the table.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Constraints/ConstraintEnforcement/ConstraintPerformanceImpact.htm
I know I can use analyze_constraint() to get ids which are violating constraints but I don't know how to use those keys for deduplication. Is there any example or standard procedure to follow it?
The fact table can have billions of rows so how expensive the analyze_constraint could be?
Any help would be appreciated!

Best Answer

  • [Deleted User][Deleted User] Administrator
    edited May 2020 Answer ✓

    You can automate process to run constraint validation and delete duplicate records. I would not initiate purge on a table or on a specific partition in a table unless you have at least 5% deleted records. Running frequent purge operations on tables with few deleted records can cause unnecessary stress on system resources. In Vertica Automatic Tuple Mover Mergeout service will purge deleted records from ROS containers that qualify for mergeout. Tuple Mover Service will also initiate purge on ROS containers that have more than 20% deleted records. Percentage value used by mergeout is configurable via config parameter PurgeMergeoutPercent (default value of 20) . Setting this value very low will again lead to aggressive mergeout operation which requires system resources. You may find these articles useful : https://www.vertica.com/kb/Deletes-in-Vertica-The-FAQs/Content/FAQs/Deletes-in-Vertica-The-FAQs.htm and https://www.vertica.com/kb/Replay-Deletes/Content/BestPractices/Replay-Deletes.htm

Answers

  • [Deleted User][Deleted User] Administrator

    There are many option to prevent loading duplicate records:

    • Copy into staging table and then merge into fact
    • Copy into fact but use "no commit" option for copy and then run check_constraints before commit.
      Above option comes with additional cost. If you decide to load into fact directly but periodically check for duplicate then check_constraint should work. For deleting duplicate records you can make use of epoch column in Vertica. Each record in Vertica has entry in epoch column that stores epoch in which a record was committed. This should work as long as duplicate records were not loaded as part of same commit epoch.

    newdbadmin=> create table test(c1 int primary key);
    CREATE TABLE
    newdbadmin=> copy test from stdin direct;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.

    1
    2
    3
    .

    newdbadmin=> copy test from stdin direct;
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.

    2
    4
    5
    6
    .

    newdbadmin=> select analyze_constraints('test','c1');
    Schema Name | Table Name | Column Names | Constraint Name | Constraint Type | Column Values
    -------------+------------+--------------+-----------------+-----------------+---------------
    public | test | c1 | C_PRIMARY | PRIMARY | ('2')
    (1 row)

    newdbadmin=> select c1, epoch from test where c1 = 2;
    c1 | epoch
    ----+---------
    2 | 2596983
    2 | 2596984
    (2 rows)

    newdbadmin=> delete from test where epoch = 2596984 and c1 = 2; commit;

    OUTPUT

      1
    

    (1 row)
    COMMIT
    newdbadmin=> select * from test;

    c1

    5
    2
    3
    1
    4
    6
    (6 rows)

  • rajatpaliwal86rajatpaliwal86 Vertica Customer

    @skamat This is really helpful and interesting to know about the epoch column. I'll try out it.
    Is there any way that we can automate this deduplication and can run during off-peak hours?
    and after deleting a few rows -should I also purge them by running select purge()?
    What do you suggest?

  • rajatpaliwal86rajatpaliwal86 Vertica Customer

    @skamat Can you help me in automating deduplication. How can I access the analyze_constraint command output to be treated as normal query output and can use it delete rows in the main table?
    Can you help me in creating the query that uses the analyze_constraint output to delete rows?

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited May 2020

    @rajatpaliwal86 - If you have a different epoch for every dup row, you can run a DELETE statemen to reove the duplicates, keeping the recordsi having the latest record...

    Example:

    dbadmin=> \d dups
                                       List of Fields by Tables
     Schema | Table | Column |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------+--------+-------------+------+---------+----------+-------------+-------------
     public | dups  | c1     | int         |    8 |         | t        | t           |
     public | dups  | c2     | varchar(10) |   10 |         | t        | t           |
    (2 rows)
    
    dbadmin=> SELECT epoch, * FROM dups ORDER BY c1, c2;
     epoch | c1 |  c2
    -------+----+-------
        84 |  1 | DATA1
        79 |  1 | DATA1
        80 |  2 | DATA2
        79 |  2 | DATA2
        83 |  3 | DATA3
        82 |  3 | DATA3
        81 |  3 | DATA3
        79 |  3 | DATA3
    (8 rows)
    
    dbadmin=> DELETE /*+ DIRECT */ FROM dups WHERE (epoch, c1, c2) NOT IN (SELECT epoch, c1, c2 FROM dups LIMIT 1 OVER (PARTITION BY c1, c2 ORDER BY epoch DESC));
     OUTPUT
    --------
          5
    (1 row)
    
    dbadmin=> SELECT epoch, * FROM dups ORDER BY c1, c2;
     epoch | c1 |  c2
    -------+----+-------
        84 |  1 | DATA1
        80 |  2 | DATA2
        83 |  3 | DATA3
    (3 rows)
    
    dbadmin=> COMMIT;
    COMMIT
    
  • rajatpaliwal86rajatpaliwal86 Vertica Customer

    @Jim_Knicely said:
    @rajatpaliwal86 - If you have a different epoch for every dup row, you can run a DELETE statemen to reove the duplicates, keeping the recordsi having the latest record...

    Example:

    dbadmin=> \d dups
                                       List of Fields by Tables
     Schema | Table | Column |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
    --------+-------+--------+-------------+------+---------+----------+-------------+-------------
     public | dups  | c1     | int         |    8 |         | t        | t           |
     public | dups  | c2     | varchar(10) |   10 |         | t        | t           |
    (2 rows)
    
    dbadmin=> SELECT epoch, * FROM dups ORDER BY c1, c2;
     epoch | c1 |  c2
    -------+----+-------
        84 |  1 | DATA1
        79 |  1 | DATA1
        80 |  2 | DATA2
        79 |  2 | DATA2
        83 |  3 | DATA3
        82 |  3 | DATA3
        81 |  3 | DATA3
        79 |  3 | DATA3
    (8 rows)
    
    dbadmin=> DELETE /*+ DIRECT */ FROM dups WHERE (epoch, c1, c2) NOT IN (SELECT epoch, c1, c2 FROM dups LIMIT 1 OVER (PARTITION BY c1, c2 ORDER BY epoch DESC));
     OUTPUT
    --------
          5
    (1 row)
    
    dbadmin=> SELECT epoch, * FROM dups ORDER BY c1, c2;
     epoch | c1 |  c2
    -------+----+-------
        84 |  1 | DATA1
        80 |  2 | DATA2
        83 |  3 | DATA3
    (3 rows)
    
    dbadmin=> COMMIT;
    COMMIT
    

    @Jim_Knicely very helpful!!
    Since the facts table could be very huge i.e. can have several billions of rows. I might do deduplication over 1 day worth of data i.e possibly during off-peak hours at night. The delete should not have any concern on a large fact table? what do you suggest?
    Most likely the duplicate records might not get inserted but there are some edge cases in which data might get inserted into Vertica twice. The fact table is partitioned by day id.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file