ignoring duplicates during COPY
I have a table that is continuously loaded with new data via the COPY command, and it receives almost a billion rows a day (with data retention up to 90 days).
There is a chance that duplicate records are sent to the COPY command, resulting in duplicates being loaded into the table.
The table includes a column called checksum which is unique for each row. I have observed that for duplicated rows, the EPOCH column values differ (indicating they were part of different COPY requests).
How can I ignore duplicates during data load?
The application which executes COPY command reads data from AWS Kinesis stream which doesnt gurantee atmost once delivery , also there are possibilities of the original data source containing duplicates
I came across the ANALYZE_CONSTRAINTS function in the documentation, which suggests running COPY with NO COMMIT, then executing ANALYZE_CONSTRAINTS to take necessary actions. However, given the large table size and frequent batch loads in my case, the ANALYZE_CONSTRAINTS function runs for an excessively long time.