ignoring duplicates during COPY

edited July 18 in General Discussion

I have a table that is continuously loaded with new data via the COPY command, and it receives almost a billion rows a day (with data retention up to 90 days).

There is a chance that duplicate records are sent to the COPY command, resulting in duplicates being loaded into the table.

The table includes a column called checksum which is unique for each row. I have observed that for duplicated rows, the EPOCH column values differ (indicating they were part of different COPY requests).

How can I ignore duplicates during data load?

The application which executes COPY command reads data from AWS Kinesis stream which doesnt gurantee atmost once delivery , also there are possibilities of the original data source containing duplicates

I came across the ANALYZE_CONSTRAINTS function in the documentation, which suggests running COPY with NO COMMIT, then executing ANALYZE_CONSTRAINTS to take necessary actions. However, given the large table size and frequent batch loads in my case, the ANALYZE_CONSTRAINTS function runs for an excessively long time.

Tagged:

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file