We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Duplicates — Vertica Forum

Duplicates

MrBudy01MrBudy01 Community Edition User

Hi,

I'm trying to write a SQL script that should compress a table that has duplicates clients.
For example I have a table with the following format:

id, first_name, last_name, date_of_birth, number_of_sales, contry_id, date_inserted ...

and we have millions of rows with this format. The problem we have is in the following example"

0, John, Smith, 1966-01-01, 5, 53231255, 2020-01-01
1, Mary, Brown, 1956-06-01, 3, 34364363, 2019-04-02
2, John, Smith, 1966-01-01, 7, 12958345, 2021-04-05

here the client in the first row is the same with the client in the third row but they have a different country_id by which they were inserted as unique client. I want to to write a script that should summarize the table with the criteria that they people with the same first/last name and data of birth are the same people and sum up some field like number_of_sales and in some fields that have different values we should save the latest insertion . After the summarization we should have:

1, Mary, Brown, 1956-06-01, 3, 34364363, 2019-04-02
2, John, Smith, 1966-01-01, 13, 12958345, 2021-04-05

I know this is somehow abstract but any suggestion are welcome.
Thank you.

Best Answer

  • SruthiASruthiA Administrator
    Answer ✓

    @MrBudy01 : Please find the query below to achieve the desired result

    dbadmin=> select * from clients;
    id | first_name | last_name | date_of_birth | number_of_sales | country_id | date_inserted
    ----+------------+-----------+---------------+-----------------+------------+---------------
    2 | John | Smith | 1966-01-01 | 7 | 12958345 | 2021-04-05
    0 | John | Smith | 1966-01-01 | 5 | 53231255 | 2020-01-01
    1 | Mary | Brown | 1956-06-01 | 3 | 34364363 | 2019-04-02
    (3 rows)

    SELECT A.id, A.first_name, A.last_name, A.date_of_birth, B.sales, A.country_id, A.date_inserted
    FROM clients A
    INNER JOIN (SELECT first_name, last_name, date_of_birth, sum(number_of_sales) sales
    FROM clients
    GROUP BY first_name, last_name ,date_of_birth
    ) B
    ON A.first_name = B.first_name AND A.last_name = B.last_name and A.date_of_birth = B. date_of_birth
    where A.date_inserted = (select max(ae.date_inserted) from clients ae where A.first_name = ae.first_name AND A.last_name = ae.last_name and A.date_of_birth = ae.date_of_birth);

    id | first_name | last_name | date_of_birth | sales | country_id | date_inserted
    ----+------------+-----------+---------------+-------+------------+---------------
    1 | Mary | Brown | 1956-06-01 | 3 | 34364363 | 2019-04-02
    2 | John | Smith | 1966-01-01 | 12 | 12958345 | 2021-04-05
    (2 rows)

Answers

  • Bryan_HBryan_H Vertica Employee Administrator

    You could try a CTE to get a combination of TopK to get the most recent id and country with an aggregate query to get the sums:

    WITH recents AS
    (SELECT id, first_name, last_name, date_of_birth, contry_id, date_inserted FROM forum_duplicates
    LIMIT 1 OVER(PARTITION BY first_name, last_name, date_of_birth ORDER BY date_inserted DESC)),
    aggregates AS
    (SELECT first_name, last_name, date_of_birth, SUM(number_of_sales) FROM forum_duplicates GROUP BY 1,2,3 ORDER BY 1,2,3)
    SELECT * FROM
    recents r JOIN aggregates a ON a.first_name = r.first_name AND a.last_name = r.last_name AND a.date_of_birth = r.date_of_birth;

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file