vertica constraint enforcement and deduplication

rajatpaliwal86 · May 2020

We have a large facts table on which primary key constraint is enabled. However, we are observing issues with load performance. Since Vertica also suggests to have this constraint disabled on the large facts table so we are planning to disable it on the table.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Constraints/ConstraintEnforcement/ConstraintPerformanceImpact.htm
I know I can use analyze_constraint() to get ids which are violating constraints but I don't know how to use those keys for deduplication. Is there any example or standard procedure to follow it?
The fact table can have billions of rows so how expensive the analyze_constraint could be?
Any help would be appreciated!

[Deleted User] · May 2020

You can automate process to run constraint validation and delete duplicate records. I would not initiate purge on a table or on a specific partition in a table unless you have at least 5% deleted records. Running frequent purge operations on tables with few deleted records can cause unnecessary stress on system resources. In Vertica Automatic Tuple Mover Mergeout service will purge deleted records from ROS containers that qualify for mergeout. Tuple Mover Service will also initiate purge on ROS containers that have more than 20% deleted records. Percentage value used by mergeout is configurable via config parameter PurgeMergeoutPercent (default value of 20) . Setting this value very low will again lead to aggressive mergeout operation which requires system resources. You may find these articles useful : https://www.vertica.com/kb/Deletes-in-Vertica-The-FAQs/Content/FAQs/Deletes-in-Vertica-The-FAQs.htm and https://www.vertica.com/kb/Replay-Deletes/Content/BestPractices/Replay-Deletes.htm

[Deleted User] · May 2020

There are many option to prevent loading duplicate records:

Copy into staging table and then merge into fact
Copy into fact but use "no commit" option for copy and then run check_constraints before commit.
Above option comes with additional cost. If you decide to load into fact directly but periodically check for duplicate then check_constraint should work. For deleting duplicate records you can make use of epoch column in Vertica. Each record in Vertica has entry in epoch column that stores epoch in which a record was committed. This should work as long as duplicate records were not loaded as part of same commit epoch.

newdbadmin=> create table test(c1 int primary key);
CREATE TABLE
newdbadmin=> copy test from stdin direct;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

1
2
3
.

newdbadmin=> copy test from stdin direct;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.

2
4
5
6
.

newdbadmin=> select c1, epoch from test where c1 = 2;
c1 | epoch
----+---------
2 | 2596983
2 | 2596984
(2 rows)

newdbadmin=> delete from test where epoch = 2596984 and c1 = 2; commit;

OUTPUT

(1 row)
COMMIT
newdbadmin=> select * from test;

c1

5
2
3
1
4
6
(6 rows)

rajatpaliwal86 · May 2020

@skamat This is really helpful and interesting to know about the epoch column. I'll try out it.
Is there any way that we can automate this deduplication and can run during off-peak hours?
and after deleting a few rows -should I also purge them by running select purge()?
What do you suggest?

rajatpaliwal86 · May 2020

@skamat Can you help me in automating deduplication. How can I access the analyze_constraint command output to be treated as normal query output and can use it delete rows in the main table?
Can you help me in creating the query that uses the analyze_constraint output to delete rows?

Jim_Knicely · May 2020

@rajatpaliwal86 - If you have a different epoch for every dup row, you can run a DELETE statemen to reove the duplicates, keeping the recordsi having the latest record...

Example:

dbadmin=> \d dups
                                   List of Fields by Tables
 Schema | Table | Column |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+--------+-------------+------+---------+----------+-------------+-------------
 public | dups  | c1     | int         |    8 |         | t        | t           |
 public | dups  | c2     | varchar(10) |   10 |         | t        | t           |
(2 rows)

dbadmin=> SELECT epoch, * FROM dups ORDER BY c1, c2;
 epoch | c1 |  c2
-------+----+-------
    84 |  1 | DATA1
    79 |  1 | DATA1
    80 |  2 | DATA2
    79 |  2 | DATA2
    83 |  3 | DATA3
    82 |  3 | DATA3
    81 |  3 | DATA3
    79 |  3 | DATA3
(8 rows)

dbadmin=> DELETE /*+ DIRECT */ FROM dups WHERE (epoch, c1, c2) NOT IN (SELECT epoch, c1, c2 FROM dups LIMIT 1 OVER (PARTITION BY c1, c2 ORDER BY epoch DESC));
 OUTPUT
--------
      5
(1 row)

dbadmin=> SELECT epoch, * FROM dups ORDER BY c1, c2;
 epoch | c1 |  c2
-------+----+-------
    84 |  1 | DATA1
    80 |  2 | DATA2
    83 |  3 | DATA3
(3 rows)

dbadmin=> COMMIT;
COMMIT

rajatpaliwal86 · May 2020

@Jim_Knicely said:
@rajatpaliwal86 - If you have a different epoch for every dup row, you can run a DELETE statemen to reove the duplicates, keeping the recordsi having the latest record...

Example:

dbadmin=> \d dups
                                   List of Fields by Tables
 Schema | Table | Column |    Type     | Size | Default | Not Null | Primary Key | Foreign Key
--------+-------+--------+-------------+------+---------+----------+-------------+-------------
 public | dups  | c1     | int         |    8 |         | t        | t           |
 public | dups  | c2     | varchar(10) |   10 |         | t        | t           |
(2 rows)

dbadmin=> SELECT epoch, * FROM dups ORDER BY c1, c2;
 epoch | c1 |  c2
-------+----+-------
    84 |  1 | DATA1
    79 |  1 | DATA1
    80 |  2 | DATA2
    79 |  2 | DATA2
    83 |  3 | DATA3
    82 |  3 | DATA3
    81 |  3 | DATA3
    79 |  3 | DATA3
(8 rows)

dbadmin=> DELETE /*+ DIRECT */ FROM dups WHERE (epoch, c1, c2) NOT IN (SELECT epoch, c1, c2 FROM dups LIMIT 1 OVER (PARTITION BY c1, c2 ORDER BY epoch DESC));
 OUTPUT
--------
      5
(1 row)

dbadmin=> SELECT epoch, * FROM dups ORDER BY c1, c2;
 epoch | c1 |  c2
-------+----+-------
    84 |  1 | DATA1
    80 |  2 | DATA2
    83 |  3 | DATA3
(3 rows)

dbadmin=> COMMIT;
COMMIT

@Jim_Knicely very helpful!!
Since the facts table could be very huge i.e. can have several billions of rows. I might do deduplication over 1 day worth of data i.e possibly during off-peak hours at night. The delete should not have any concern on a large fact table? what do you suggest?
Most likely the duplicate records might not get inserted but there are some edge cases in which data might get inserted into Vertica twice. The fact table is partitioned by day id.

We're Moving!

Create My New Community Account Now

vertica constraint enforcement and deduplication

Best Answer

Answers

OUTPUT

c1

Leave a Comment