Propagation of duplicate rows into target table after MERGE INTO
I am running a sequence of MERGE INTO operations of the following kind:
MERGE /*+ direct */
INTO target_table T
USING source_table S
ON (T.key1 = S.key1 AND
T.key2 = S.key2)
WHEN NOT MATCHED THEN
INSERT (key1, key2, field1, field2)
VALUES (S.key1, S.key2, S.field1, S.field2)
where both source and target tables have the definition, apart from the table names, having (key1, key2) defined as a primary key.
source table is loaded sequentially with files of the size ~3GB and totally ~10 million lines/entries, and after every load (COPY DIRECT load), the data is merged into the target table using the above command.
However, following the MERGE INTO, after loading and merging 2 or more files, I am seeing a small number of duplicata rows ending up in the target table, despite enforcing the primary key. The fraction of duplicate entries is small (<1%) but still amounts to thousands of entries.
To illustrate the steps in more details:
Initially both target and source tables are empty.
- Step 1: Load file1.txt with 10 million unique entries into table S
- Step 2: MERGE INTO table S into table T. Following the merge, empty the table S
- Step 3. Load file2.txt with 10 million unique entries into table S
- Step 4: MERGE INTO table S into table T. Following the merge, empty the table S
Following the Step 4, I notice that table T has ~14 million entries (hence a large fraction of duplicates was removed as expected), but also a very small number of duplicate entries (<1%) remains.
Why is this?
Is MERGE INTO underneath implementing a complete merge and elimination of duplicates when so indicated with MATCHED clause in the command, or is it just approximate?
Does the number of entries (~dozens of millions) affect the accuracy of the MERGE INTO?
The tables used in the MERGE INTO were created with primary key across 5 columns (1 int, 4 varchar), and only default super-projection exists.
Vertica DB version in use is 7.1