Duplicates
Hi,
I'm trying to write a SQL script that should compress a table that has duplicates clients.
For example I have a table with the following format:
id, first_name, last_name, date_of_birth, number_of_sales, contry_id, date_inserted ...
and we have millions of rows with this format. The problem we have is in the following example"
0, John, Smith, 1966-01-01, 5, 53231255, 2020-01-01
1, Mary, Brown, 1956-06-01, 3, 34364363, 2019-04-02
2, John, Smith, 1966-01-01, 7, 12958345, 2021-04-05
here the client in the first row is the same with the client in the third row but they have a different country_id by which they were inserted as unique client. I want to to write a script that should summarize the table with the criteria that they people with the same first/last name and data of birth are the same people and sum up some field like number_of_sales and in some fields that have different values we should save the latest insertion . After the summarization we should have:
1, Mary, Brown, 1956-06-01, 3, 34364363, 2019-04-02
2, John, Smith, 1966-01-01, 13, 12958345, 2021-04-05
I know this is somehow abstract but any suggestion are welcome.
Thank you.
Best Answer
-
SruthiA Vertica Employee Administrator
@MrBudy01 : Please find the query below to achieve the desired result
dbadmin=> select * from clients;
id | first_name | last_name | date_of_birth | number_of_sales | country_id | date_inserted
----+------------+-----------+---------------+-----------------+------------+---------------
2 | John | Smith | 1966-01-01 | 7 | 12958345 | 2021-04-05
0 | John | Smith | 1966-01-01 | 5 | 53231255 | 2020-01-01
1 | Mary | Brown | 1956-06-01 | 3 | 34364363 | 2019-04-02
(3 rows)SELECT A.id, A.first_name, A.last_name, A.date_of_birth, B.sales, A.country_id, A.date_inserted
FROM clients A
INNER JOIN (SELECT first_name, last_name, date_of_birth, sum(number_of_sales) sales
FROM clients
GROUP BY first_name, last_name ,date_of_birth
) B
ON A.first_name = B.first_name AND A.last_name = B.last_name and A.date_of_birth = B. date_of_birth
where A.date_inserted = (select max(ae.date_inserted) from clients ae where A.first_name = ae.first_name AND A.last_name = ae.last_name and A.date_of_birth = ae.date_of_birth);id | first_name | last_name | date_of_birth | sales | country_id | date_inserted
----+------------+-----------+---------------+-------+------------+---------------
1 | Mary | Brown | 1956-06-01 | 3 | 34364363 | 2019-04-02
2 | John | Smith | 1966-01-01 | 12 | 12958345 | 2021-04-05
(2 rows)0
Answers
You could try a CTE to get a combination of TopK to get the most recent id and country with an aggregate query to get the sums:
WITH recents AS
(SELECT id, first_name, last_name, date_of_birth, contry_id, date_inserted FROM forum_duplicates
LIMIT 1 OVER(PARTITION BY first_name, last_name, date_of_birth ORDER BY date_inserted DESC)),
aggregates AS
(SELECT first_name, last_name, date_of_birth, SUM(number_of_sales) FROM forum_duplicates GROUP BY 1,2,3 ORDER BY 1,2,3)
SELECT * FROM
recents r JOIN aggregates a ON a.first_name = r.first_name AND a.last_name = r.last_name AND a.date_of_birth = r.date_of_birth;