Analogue for COUNT(DISTINCT ) OVER (... RANGE BETWEEN)

chernenko_roma · September 2022

Hi All,

I want to share the solution if it will be useful to someone in the future, because I could not find tips for myself.

The task is the calculation of unique values for a single member for a certain period, in my case 1 year.

sure I will expect to have the following for use:

COUNT(DISTINCT Column1) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN > INTERVAL '1 years' PRECEDING AND CURRENT ROW)

but unfortunately I received an isuue (which is mentioned not ones on that site):
* SQL Error [7830] [42V13]: [Vertica]VJDBC ERROR: Only MIN/MAX and AND/OR are allowed to use DISTINCT

So, I did it in two steps:

assign each value a unique number

CASE WHEN ISNULL(Column1,'')='' THEN NULL ELSE DENSE_RANK() OVER (PARTITION BY "Member" ORDER BY Column1 DESC NULLS LAST)

sum within a row the possible number of unique values (need to calculate unique values from Column1)

CASE WHEN (SUM(DECODE(Column1RankPerMember,1,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END +
CASE WHEN (SUM(DECODE(Column1RankPerMember,2,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END +
...

The whole code and its result you may see below:

WITH cte AS (
SELECT 1 AS RowId, '11' AS "Member", '2020-01-22'::DATE AS "Date", '60' AS Column1 UNION
SELECT 2,'11','2020-3-15'::DATE,'10' UNION
SELECT 3,'11','2020-3-16'::DATE,'20' UNION
SELECT 4,'11','2020-3-17'::DATE,'30' UNION
SELECT 5,'11','2020-4-15'::DATE,'10' UNION
SELECT 6,'11','2020-4-16'::DATE,'20' UNION
SELECT 7,'11','2020-4-17'::DATE,'30' UNION
SELECT 8,'11','2020-5-15'::DATE,'10' UNION
SELECT 9,'11','2020-5-16'::DATE,'20' UNION
SELECT 10,'11','2020-5-17'::DATE,'30' UNION
SELECT 11,'11','2020-12-15'::DATE,'10' UNION
SELECT 12,'11','2020-12-16'::DATE,'20' UNION
SELECT 13,'11','2020-12-17'::DATE,'30' UNION
SELECT 14,'11','2021-1-22'::DATE,'10' UNION
SELECT 15,'11','2021-1-23'::DATE, NULL UNION
SELECT 16,'11','2021-1-24'::DATE,'30' UNION
SELECT 17,'11','2021-3-22'::DATE,'10' UNION
SELECT 18,'11','2021-3-23'::DATE,'20' UNION
SELECT 19,'11','2021-3-24'::DATE,'30' UNION
SELECT 20,'11','2021-6-24'::DATE,'40' UNION
SELECT 21,'11','2021-6-24'::DATE,'' UNION
SELECT 22,'11','2021-6-26'::DATE,'50' UNION
SELECT 23,'22','2021-1-24'::DATE,'30' UNION
SELECT 24,'22','2021-2-22'::DATE,'10' UNION
SELECT 25,'22','2021-3-23'::DATE,'20' UNION
SELECT 26,'22','2021-4-24'::DATE,'40' UNION
SELECT 27,'22','2021-6-24'::DATE,'40' UNION
SELECT 28,'22','2022-8-24'::DATE,'10'
),
cte1 AS (
SELECT *,
CASE WHEN ISNULL(Column1,'')='' THEN NULL ELSE DENSE_RANK() OVER (PARTITION BY "Member" ORDER BY Column1 DESC NULLS LAST) END AS "Column1RankPerMember"
FROM cte
)
SELECT
RowId,
"Member",
"Date",
Column1,
CASE WHEN (SUM(DECODE(Column1RankPerMember,1,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END +
CASE WHEN (SUM(DECODE(Column1RankPerMember,2,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END +
CASE WHEN (SUM(DECODE(Column1RankPerMember,3,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END +
CASE WHEN (SUM(DECODE(Column1RankPerMember,4,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END +
CASE WHEN (SUM(DECODE(Column1RankPerMember,5,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END +
CASE WHEN (SUM(DECODE(Column1RankPerMember,6,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END +
CASE WHEN (SUM(DECODE(Column1RankPerMember,7,1,0)) OVER (PARTITION BY "Member" ORDER BY "Date" RANGE BETWEEN INTERVAL '1 years' PRECEDING AND CURRENT ROW))>0 THEN 1 ELSE 0 END
AS UniquePerMemberperPeriod
FROM cte1
ORDER BY 2,3;

Result:

RowId       Member      Date        Column1     UniquePerMemberperPeriod
1       11      2020-01-22      60      1
2       11      2020-03-15      10      2
3       11      2020-03-16      20      3
4       11      2020-03-17      30      4
5       11      2020-04-15      10      4
6       11      2020-04-16      20      4
7       11      2020-04-17      30      4
8       11      2020-05-15      10      4
9       11      2020-05-16      20      4
10      11      2020-05-17      30      4
11      11      2020-12-15      10      4
12      11      2020-12-16      20      4
13      11      2020-12-17      30      4
14      11      2021-01-22      10      3
15      11      2021-01-23      [NULL]      3
16      11      2021-01-24      30      3
17      11      2021-03-22      10      3
18      11      2021-03-23      20      3
19      11      2021-03-24      30      3
21      11      2021-06-24              4
20      11      2021-06-24      40      4
22      11      2021-06-26      50      5
23      22      2021-01-24      30      1
24      22      2021-02-22      10      2
25      22      2021-03-23      20      3
26      22      2021-04-24      40      4
27      22      2021-06-24      40      4
28      22      2022-08-24      10      1

Quite a logical question about the performance: 160 million rows and targeting the existence of about 300 unique records (provided code targeting 7), takes approximately 150 sec.

VValdar · September 2022

Hi chernenko_roma,

Love the creativity of this!
How does it compare on your real case, performance wise, to the "brute" join groupby?

  select t1.RowId
       , t1."Member"
       , t1."Date"
       , t1.Column1
       , count(distinct t2.Column1)
    from tst_data as t1
    join tst_data as t2  on t2."Member" = t1."Member"
                        and t2."Date"  >= add_months(t1."Date", -12)
                        and t2."Date"  <= t1."Date"
group by 1, 2, 3, 4
order by 2,3;

chernenko_roma · October 2022

Hello @VValdar ,

Unfortunately there is no significant difference in performance, the proposed option is about minus a couple of seconds.

But, I am extremely grateful for the proposed option, since it is much easier to read in terms of the code wrote by me, the amount of code is reduced a lot.

VValdar · October 2022

Sometime the shorter code is not the more optimized.
Tbh I thought your solution would be fastest.

Let's keep both solutions somewhere in our head so when needed we have multiple opportunities

Thanks again for sharing yours with anyone here.

We're Moving!

Create My New Community Account Now

Analogue for COUNT(DISTINCT ) OVER (... RANGE BETWEEN)

Comments

Leave a Comment