ignoring duplicates during COPY
I have a table that is continuously loaded with new data via the COPY command, and it receives almost a billion rows a day (with data retention up to 90 days).
There is a chance that duplicate records are sent to the COPY command, resulting in duplicates being loaded into the table.
The table includes a column called checksum which is unique for each row. I have observed that for duplicated rows, the EPOCH column values differ (indicating they were part of different COPY requests).
How can I ignore duplicates during data load?
The application which executes COPY command reads data from AWS Kinesis stream which doesnt gurantee atmost once delivery , also there are possibilities of the original data source containing duplicates
I came across the ANALYZE_CONSTRAINTS function in the documentation, which suggests running COPY with NO COMMIT, then executing ANALYZE_CONSTRAINTS to take necessary actions. However, given the large table size and frequent batch loads in my case, the ANALYZE_CONSTRAINTS function runs for an excessively long time.
Answers
Any comments ?
Ignoring duplicate rows during load would be quite expensive.
For vertica, best approach would be to load into stage table, and then copy from stage table to your table only rows that do not have duplicates.