ignoring duplicates during COPY

prad33p2301 · July 2024

I have a table that is continuously loaded with new data via the COPY command, and it receives almost a billion rows a day (with data retention up to 90 days).

There is a chance that duplicate records are sent to the COPY command, resulting in duplicates being loaded into the table.

The table includes a column called checksum which is unique for each row. I have observed that for duplicated rows, the EPOCH column values differ (indicating they were part of different COPY requests).

How can I ignore duplicates during data load?

The application which executes COPY command reads data from AWS Kinesis stream which doesnt gurantee atmost once delivery , also there are possibilities of the original data source containing duplicates

I came across the ANALYZE_CONSTRAINTS function in the documentation, which suggests running COPY with NO COMMIT, then executing ANALYZE_CONSTRAINTS to take necessary actions. However, given the large table size and frequent batch loads in my case, the ANALYZE_CONSTRAINTS function runs for an excessively long time.

prad33p2301 · November 2024

Any comments ?

Sergey_Cherepan_1 · November 2024

Ignoring duplicate rows during load would be quite expensive.
For vertica, best approach would be to load into stage table, and then copy from stage table to your table only rows that do not have duplicates.

We're Moving!

Create My New Community Account Now

ignoring duplicates during COPY

Answers

Leave a Comment