The Vertica Forum recently got a makeover! Let us know what you think by filling out this short, anonymous survey.
Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

Duplicates

MrBudy01MrBudy01 Community Edition User

Hi,

I'm trying to write a SQL script that should compress a table that has duplicates clients.
For example I have a table with the following format:

id, first_name, last_name, date_of_birth, number_of_sales, contry_id, date_inserted ...

and we have millions of rows with this format. The problem we have is in the following example"

0, John, Smith, 1966-01-01, 5, 53231255, 2020-01-01
1, Mary, Brown, 1956-06-01, 3, 34364363, 2019-04-02
2, John, Smith, 1966-01-01, 7, 12958345, 2021-04-05

here the client in the first row is the same with the client in the third row but they have a different country_id by which they were inserted as unique client. I want to to write a script that should summarize the table with the criteria that they people with the same first/last name and data of birth are the same people and sum up some field like number_of_sales and in some fields that have different values we should save the latest insertion . After the summarization we should have:

1, Mary, Brown, 1956-06-01, 3, 34364363, 2019-04-02
2, John, Smith, 1966-01-01, 13, 12958345, 2021-04-05

I know this is somehow abstract but any suggestion are welcome.
Thank you.

Best Answer

  • SruthiASruthiA Vertica Employee Employee
    Answer ✓

    @MrBudy01 : Please find the query below to achieve the desired result

    dbadmin=> select * from clients;
    id | first_name | last_name | date_of_birth | number_of_sales | country_id | date_inserted
    ----+------------+-----------+---------------+-----------------+------------+---------------
    2 | John | Smith | 1966-01-01 | 7 | 12958345 | 2021-04-05
    0 | John | Smith | 1966-01-01 | 5 | 53231255 | 2020-01-01
    1 | Mary | Brown | 1956-06-01 | 3 | 34364363 | 2019-04-02
    (3 rows)

    SELECT A.id, A.first_name, A.last_name, A.date_of_birth, B.sales, A.country_id, A.date_inserted
    FROM clients A
    INNER JOIN (SELECT first_name, last_name, date_of_birth, sum(number_of_sales) sales
    FROM clients
    GROUP BY first_name, last_name ,date_of_birth
    ) B
    ON A.first_name = B.first_name AND A.last_name = B.last_name and A.date_of_birth = B. date_of_birth
    where A.date_inserted = (select max(ae.date_inserted) from clients ae where A.first_name = ae.first_name AND A.last_name = ae.last_name and A.date_of_birth = ae.date_of_birth);

    id | first_name | last_name | date_of_birth | sales | country_id | date_inserted
    ----+------------+-----------+---------------+-------+------------+---------------
    1 | Mary | Brown | 1956-06-01 | 3 | 34364363 | 2019-04-02
    2 | John | Smith | 1966-01-01 | 12 | 12958345 | 2021-04-05
    (2 rows)

Answers

  • Bryan_HBryan_H Vertica Employee Administrator

    You could try a CTE to get a combination of TopK to get the most recent id and country with an aggregate query to get the sums:

    WITH recents AS
    (SELECT id, first_name, last_name, date_of_birth, contry_id, date_inserted FROM forum_duplicates
    LIMIT 1 OVER(PARTITION BY first_name, last_name, date_of_birth ORDER BY date_inserted DESC)),
    aggregates AS
    (SELECT first_name, last_name, date_of_birth, SUM(number_of_sales) FROM forum_duplicates GROUP BY 1,2,3 ORDER BY 1,2,3)
    SELECT * FROM
    recents r JOIN aggregates a ON a.first_name = r.first_name AND a.last_name = r.last_name AND a.date_of_birth = r.date_of_birth;

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file