vertica constraint enforcement and deduplication
We have a large facts table on which primary key constraint is enabled. However, we are observing issues with load performance. Since Vertica also suggests to have this constraint disabled on the large facts table so we are planning to disable it on the table.
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AdministratorsGuide/Constraints/ConstraintEnforcement/ConstraintPerformanceImpact.htm
I know I can use analyze_constraint() to get ids which are violating constraints but I don't know how to use those keys for deduplication. Is there any example or standard procedure to follow it?
The fact table can have billions of rows so how expensive the analyze_constraint could be?
Any help would be appreciated!
Best Answer
-
[Deleted User] Administrator
You can automate process to run constraint validation and delete duplicate records. I would not initiate purge on a table or on a specific partition in a table unless you have at least 5% deleted records. Running frequent purge operations on tables with few deleted records can cause unnecessary stress on system resources. In Vertica Automatic Tuple Mover Mergeout service will purge deleted records from ROS containers that qualify for mergeout. Tuple Mover Service will also initiate purge on ROS containers that have more than 20% deleted records. Percentage value used by mergeout is configurable via config parameter PurgeMergeoutPercent (default value of 20) . Setting this value very low will again lead to aggressive mergeout operation which requires system resources. You may find these articles useful : https://www.vertica.com/kb/Deletes-in-Vertica-The-FAQs/Content/FAQs/Deletes-in-Vertica-The-FAQs.htm and https://www.vertica.com/kb/Replay-Deletes/Content/BestPractices/Replay-Deletes.htm
1
Answers
There are many option to prevent loading duplicate records:
Above option comes with additional cost. If you decide to load into fact directly but periodically check for duplicate then check_constraint should work. For deleting duplicate records you can make use of epoch column in Vertica. Each record in Vertica has entry in epoch column that stores epoch in which a record was committed. This should work as long as duplicate records were not loaded as part of same commit epoch.
newdbadmin=> create table test(c1 int primary key);
CREATE TABLE
newdbadmin=> copy test from stdin direct;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
newdbadmin=> copy test from stdin direct;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
newdbadmin=> select analyze_constraints('test','c1');
Schema Name | Table Name | Column Names | Constraint Name | Constraint Type | Column Values
-------------+------------+--------------+-----------------+-----------------+---------------
public | test | c1 | C_PRIMARY | PRIMARY | ('2')
(1 row)
newdbadmin=> select c1, epoch from test where c1 = 2;
c1 | epoch
----+---------
2 | 2596983
2 | 2596984
(2 rows)
newdbadmin=> delete from test where epoch = 2596984 and c1 = 2; commit;
OUTPUT
(1 row)
COMMIT
newdbadmin=> select * from test;
c1
5
2
3
1
4
6
(6 rows)
@skamat This is really helpful and interesting to know about the epoch column. I'll try out it.
Is there any way that we can automate this deduplication and can run during off-peak hours?
and after deleting a few rows -should I also purge them by running select purge()?
What do you suggest?
@skamat Can you help me in automating deduplication. How can I access the analyze_constraint command output to be treated as normal query output and can use it delete rows in the main table?
Can you help me in creating the query that uses the analyze_constraint output to delete rows?
@rajatpaliwal86 - If you have a different epoch for every dup row, you can run a DELETE statemen to reove the duplicates, keeping the recordsi having the latest record...
Example:
@Jim_Knicely very helpful!!
Since the facts table could be very huge i.e. can have several billions of rows. I might do deduplication over 1 day worth of data i.e possibly during off-peak hours at night. The delete should not have any concern on a large fact table? what do you suggest?
Most likely the duplicate records might not get inserted but there are some edge cases in which data might get inserted into Vertica twice. The fact table is partitioned by day id.