Duplicates

MrBudy01MrBudy01 Community Edition User

Hi,

I'm trying to write a SQL script that should compress a table that has duplicates clients.
For example I have a table with the following format:

id, first_name, last_name, date_of_birth, number_of_sales, contry_id, date_inserted ...

and we have millions of rows with this format. The problem we have is in the following example"

0, John, Smith, 1966-01-01, 5, 53231255, 2020-01-01
1, Mary, Brown, 1956-06-01, 3, 34364363, 2019-04-02
2, John, Smith, 1966-01-01, 7, 12958345, 2021-04-05

here the client in the first row is the same with the client in the third row but they have a different country_id by which they were inserted as unique client. I want to to write a script that should summarize the table with the criteria that they people with the same first/last name and data of birth are the same people and sum up some field like number_of_sales and in some fields that have different values we should save the latest insertion . After the summarization we should have:

1, Mary, Brown, 1956-06-01, 3, 34364363, 2019-04-02
2, John, Smith, 1966-01-01, 13, 12958345, 2021-04-05

I know this is somehow abstract but any suggestion are welcome.
Thank you.

Best Answer

  • SruthiASruthiA Vertica Employee Administrator
    Answer ✓

    @MrBudy01 : Please find the query below to achieve the desired result

    dbadmin=> select * from clients;
    id | first_name | last_name | date_of_birth | number_of_sales | country_id | date_inserted
    ----+------------+-----------+---------------+-----------------+------------+---------------
    2 | John | Smith | 1966-01-01 | 7 | 12958345 | 2021-04-05
    0 | John | Smith | 1966-01-01 | 5 | 53231255 | 2020-01-01
    1 | Mary | Brown | 1956-06-01 | 3 | 34364363 | 2019-04-02
    (3 rows)

    SELECT A.id, A.first_name, A.last_name, A.date_of_birth, B.sales, A.country_id, A.date_inserted
    FROM clients A
    INNER JOIN (SELECT first_name, last_name, date_of_birth, sum(number_of_sales) sales
    FROM clients
    GROUP BY first_name, last_name ,date_of_birth
    ) B
    ON A.first_name = B.first_name AND A.last_name = B.last_name and A.date_of_birth = B. date_of_birth
    where A.date_inserted = (select max(ae.date_inserted) from clients ae where A.first_name = ae.first_name AND A.last_name = ae.last_name and A.date_of_birth = ae.date_of_birth);

    id | first_name | last_name | date_of_birth | sales | country_id | date_inserted
    ----+------------+-----------+---------------+-------+------------+---------------
    1 | Mary | Brown | 1956-06-01 | 3 | 34364363 | 2019-04-02
    2 | John | Smith | 1966-01-01 | 12 | 12958345 | 2021-04-05
    (2 rows)

Answers

  • Bryan_HBryan_H Vertica Employee Administrator

    You could try a CTE to get a combination of TopK to get the most recent id and country with an aggregate query to get the sums:

    WITH recents AS
    (SELECT id, first_name, last_name, date_of_birth, contry_id, date_inserted FROM forum_duplicates
    LIMIT 1 OVER(PARTITION BY first_name, last_name, date_of_birth ORDER BY date_inserted DESC)),
    aggregates AS
    (SELECT first_name, last_name, date_of_birth, SUM(number_of_sales) FROM forum_duplicates GROUP BY 1,2,3 ORDER BY 1,2,3)
    SELECT * FROM
    recents r JOIN aggregates a ON a.first_name = r.first_name AND a.last_name = r.last_name AND a.date_of_birth = r.date_of_birth;

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file