Analytic function for conditional reset cumulative sum

Massimo_Loporch · July 2014

I need to implement an analytic function that realizes a cumulative sum that can be reset if a second column has a value greater than zero, in this case the sum restarts from the value of the second column. I need some kind of select cumsum(case when col1>0 then reset to col1 else col2+col3 end) from table group by. ... any code sample in c++ uda ?

Navin_C · July 2014

Hi Massimo,

Can you post some example of resultset expected along with columns.

Massimo_Loporch · July 2014

Hi Navin...

ok this an example, but my table has a 2 billion rows:

ID_ITEM DATE FISICAL_INVENTORY QTA_BUYED QTA_SOLD DAY_INVENTORY
(what i need to compute)

item1 1/1/2014 100 0 0 100
item1 2/1/2014 0 10 90 20
changin item reset the cumulative sum or using group by clause
item2 1/1/2014 0 10 0 10
item2 2/2/2014 0 0 0 10
item2 3/1/2014 0 110 10 110
item2 4/1/2014 20 10 5 25 inv. diff= -90
item2 5/1/2014 0 0 5 20
....

I hope to have an help to write a c++ uda function in vertica that i think can be the faster and on-line solution.

Regards.

colin_loghin · August 2014

I was not able to follow the example on how to compute day_inventory. Analytics sum are providing cumulative sums by default, but you may need to tweak the partition clause and add a case statement for a secondary column reset ( case col2 > 0 then 1 else 0) as change_cond , partition by item1, change_cond.

Daniel_Leybovic · August 2014

Hi!

Hm... why second row (item1, 2/1/2014) returns 20?
By your CASE definition: col1 = 0 => return col2 + col3 = 100 => cumsum = 100 + 100 = 200

why 4/1/2014 return 25 and not 20?
By your CASE definition: col1 = 20 => col1 > 0 => reset to col1 => result = 20, but it actually = col1 + col2 + col3 = 25. O_o

if a second column has a value greater than zero, in this case the sum restarts from the value of the second column

So col1 or col2? Do not match your SQL case definition.

Best

Massimo_Loporch · August 2014

Sorry,
qta_sold has always a negative value,
qta_buyed is always positive

Tank You!

Massimo_Loporch · August 2014

Hi colin, thanks for you suggest ...
but .... it will not work ...

To create the dataset:

CREATE TABLE TM_MYTABLE ( id_item Varchar(100),
"date" Date,
fiscal_inventory Varchar(100),
qta_buyed Float,
qta_sold Float,
day_inventory Float );

insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);
insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);

id_item date fiscal_inventory qta_buyed qta_sold day_inventory (expected)
item1 01/01/2014 100 0 0 100
item1 02/01/2014 0 10 -90 20
item2 01/01/2014 0 10 0 10
item2 02/01/2014 0 0 0 10
item2 03/01/2014 0 110 -10 110
item2 04/01/2014 20 10 -5 25
item2 05/01/2014 0 0 -5 20

The following select does not return the right number :

SELECT id_item, "date", fiscal_inventory, qta_buyed, qta_sold, day_inventory expected,sum (fiscal_inventory+ qta_buyed+qta_sold)
over (partition by id_item , case when fiscal_inventory>0 then 1 else 0 end
order by "date") as day_inventory
FROM TM_MYTABLE
order by id_item, "date"

will give me this dataset :

id_item date fiscal_inventory qta_buyed qta_sold expected_result day_inventory

item1 01/01/2014 100 0 0 100 100
item1 02/01/2014 0 10 -90 20 (expected) -80 (wrong)
(correct : 100 as previous day_inventory + 10 - 90 = 20, and not -80)
This is because it is partitioned by 1 if fiscal_inventory > 0 and 0 if not.

item2 01/01/2014 0 10 0 10 10
item2 02/01/2014 0 0 0 10 10
item2 03/01/2014 0 110 -10 110 110
item2 04/01/2014 20 10 -5 25 25
item2 05/01/2014 0 0 -5 20 105 (wrong)

Massimo_Loporch · August 2014

CREATE TABLE TM_MYTABLE ( id_item Varchar(100),
"date" Date,
fiscal_inventory Varchar(100),
qta_buyed Float,
qta_sold Float,
day_inventory Float );

insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);
insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);

Massimo_Loporch · August 2014

CREATE TABLE TM_MYTABLE ( id_item Varchar(100),
"date" Date,
fiscal_inventory Varchar(100),
qta_buyed Float,
qta_sold Float,
day_inventory Float );

insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);
insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);

Daniel_Leybovic · August 2014

Hi!

You need Transform function, but Transform function in Vertica has limitation:

UDTFs can only be used in the SELECT list that contains just the UDTF call and a required OVER clause.

i.e, you can't get other columns except CumSum. Is it good for you? =>

select ConditionalCumSum(col1, col2, col3) over (partition by item order by date) from inventory;

CUMSUM
--------
    100
     20
     10
      10
    110
     20
     15
(7 rows)

PS
Still doesn't match (2 last rows). Can you explain - why 25 (item2 4/1/2014) and not 20?

[Deleted User] · August 2014

For what it's worth, in recent Vertica versions, I believe that columns in the UDT's PARTITION BY expression can also be fetched in the SELECT clause.

The reason you can't emit other columns with a UDT is that a UDT doesn't have a 1:1 correspondence between input rows and output rows. You could read in a million rows and emit just one row per partition, like a per-partition aggregate function. Or the reverse -- for example, a tokenizer UDT like the one on our github site, that takes in a collection of text documents and emits one row per word in any document. In either of these cases, for fields not in the UDT, how do we map values from input rows back onto output rows?

For simple cases, take a look at UDAnalytic functions. They are like UDT's but enforce a 1:1 correspondence between input and output rows. So you can select other columns in addition to the output of the function.

For more complex cases, a simple workaround is to pass the extra columns into your UDT and have it decide what values to emit for what rows.

Also -- UDT's can be used in subqueries. So you can fetch a set of columns with a UDT in the inner query, then use the outer query to join against additional tables, apply filters, rearrange columns, etc.

Adam

[Deleted User] · August 2014

Hi Massimo,

Identifying patterns like this actually sounds like a potential use case for Vertica's MATCH SQL extension, intended for event-series analysis:

http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/SELECT/MATCH...

MATCH clauses are like regular expressions that scan down through your rows of data. You can easily write expressions to identify this sort of pattern. (Let me know if you're having a hard time with the syntax; I or others here can probably post examples.)

If the sum is reset every time col1 is nonzero, it sounds like what you want are patterns of the form "(col1 is nonzero) (col1 is zero)*". So, write that pattern; then *within each partition*, because you know that you start with col1 being nonzero and it's always zero after that until the end of the partition, just do something like "sum(case when col1 > 0 then col1 else col2 end)".

Adam

Massimo_Loporch · August 2014

I Daniel,
for your question :
<<Still doesn't match (2 last rows). Can you explain - why 25 (item2 4/1/2014) and not 20?>>
the answer is
the 4/1/2014 item2 has an inventory for 20 pz ... so my count must restart from 20, then I buyed 10 and sold 5 so 20 + 10 - 5 = 25 at the end of the day, the fact that I have a difference of 90 is the inventory difference from the fiscal and fisical inventory.

Daniel_Leybovic · August 2014

Im tired, or I can't understand or you can't define your problem.

when col1>0 then reset to col1 else col2+col3 end

col1 = 20 > 0 => reset to col1 => 20.

Good Luck, may be Adam can help you.

Massimo_Loporch · August 2014

Great Adam !!!
WOW !

select id_item, "date", fiscal_inventory, qta_buyed, qta_sold, expected,
sum (fiscal_inventory+qta_buyed+qta_sold) over(partition by id_item,p_id order by "date" )
from
(
SELECT id_item, "date", fiscal_inventory, qta_buyed, qta_sold, day_inventory expected,
pattern_id() as p_id
FROM TM_MYTABLE
MATCH (partition by id_item order by "date"
DEFINE
inventory as fiscal_inventory > 0,
no_inventory as fiscal_inventory = 0
PATTERN
P AS (inventory* no_inventory*)
)
order by id_item, "date"

) dati

group by id_item, "date", fiscal_inventory, qta_buyed, qta_sold, expected , p_id

id_item date fiscal_inventory qta_buyed qta_sold expected DAY_INVENTORY
item1 01/01/2014 100 0 0 100 100
item1 02/01/2014 0 10 -90 20 20
item2 01/01/2014 0 10 0 10 10
item2 02/01/2014 0 0 0 10 10
item2 03/01/2014 0 110 -10 110 110
item2 04/01/2014 20 10 -5 25 25
item2 05/01/2014 0 0 -5 20 20

Now expected computation and DAY_INVENTORY are ok!

Now I don't know if the resources will be sufficient for my "1 billion record table"

Thank you all for the help

Massimo.

Massimo_Loporch · August 2014

No no you are right ... my problem definition is not perfect ... reset to (column 1 + column 2+column 3) become after the inventory the item can be sold or buyed ... sorry again Massimo

Navin_C · August 2014

Hi Massimo,

After a lot brainstorming on this scenario.Me and my colleagues came up with this.

First creating a temp table :

CREATE TABLE NAVIN.test_tmp as SELECT A.ID_ITEM,isnull(A.RNK,100) rnk,B.ITEM_DATE
FROM
(
   SELECT B.*,RNK FROM
   ( SELECT * FROM(             SELECT *,CASE WHEN FISCAL_INVENTORY > 0 THEN  RANK() OVER(PARTITION BY ID_ITEM ORDER BY ITEM_DATE) END RNK  FROM NAVIN.ITEM     )A     WHERE RNK IS NOT NULL      
       )
       A      RIGHT JOIN    (SELECT * FROM NAVIN.ITEM) B ON A.ID_ITEM = B.ID_ITEM AND A.ITEM_DATE <= B.ITEM_DATE
)A
LEFT JOIN ( SELECT B.*,RNK FROM ( SELECT * FROM( SELECT *,CASE WHEN FISCAL_INVENTORY > 0 THEN  RANK() OVER(PARTITION BY ID_ITEM ORDER BY ITEM_DATE) END RNK FROM NAVIN.ITEM 
                            )A WHERE RNK IS NOT NULL )
                           A RIGHT JOIN (SELECT * FROM NAVIN.ITEM) B ON A.ID_ITEM = B.ID_ITEM AND A.ITEM_DATE <= B.ITEM_DATE
                )B ON A.ID_ITEM = B.ID_ITEM AND isnull(A.RNK,100) = isnull(B.RNK,100) AND A.ITEM_DATE <= B.ITEM_DATE
GROUP BY A.ID_ITEM,A.RNK,B.ITEM_DATE

Then using this query:

select a.ID_ITEM,a.rnk,b.item_date,sum(a.FISCAL_INVENTORY),sum(a.QTA_BUYED),sum(a.qta_sold),sum(a.FISCAL_INVENTORY)+sum(a.QTA_BUYED)-sum(a.qta_sold) from (select a.*,b.rnk from navin.item a join navin.test_tmp b on a.id_item=b.id_item and a.item_date = b.item_date)a
left join navin.test_tmp b on a.rnk = b.rnk and a.ID_ITEM = b.ID_ITEM and a.item_date <=b.item_date
group by a.ID_ITEM,a.rnk,b.item_date
order by 1,3,2

 ID_ITEM | rnk | item_date  | sum | sum | sum | ?column?---------+-----+------------+-----+-----+-----+----------
 item1   |   1 | 2014-01-01 | 100 |   0 |   0 |      100
 item1   |   1 | 2014-01-02 | 100 |  10 |  90 |       20
 item2   | 100 | 2014-01-01 |   0 |  10 |   0 |       10
 item2   | 100 | 2014-01-02 |   0 |  10 |   0 |       10
 item2   | 100 | 2014-01-03 |   0 | 120 |  10 |      110
 item2   |   4 | 2014-01-04 |  20 |  10 |   5 |       25
 item2   |   4 | 2014-01-05 |  20 |  10 |  10 |       20

However, I wish I would know about MATCH clause in beginning itself.
Truely Vertica is an analytic platform

Hope this helps.
NC

Daniel_Leybovic · August 2014

I found that MATCH-PATTERN isn't deterministic - can return different results for same data set.
Good Luck

VDB=> select count(*) from  (^J        SELECT^J            internal_user_id AS uid^J         FROM^J            sch.tbl^J        WHERE^J             session_event_time_ts > '2014-06-23'^J                 and^J            session_event_time_ts < '2014-06-25'^J        MATCH  (^J            PARTITION BY internal_user_id^J            ORDER BY  session_event_time_ts^J            DEFINE^J                E1    AS session_event_code='S',^J                E2    AS session_event_code='A',^J                E3    AS session_event_code='R'^J            PATTERN P3 AS (E1 E2 E3) RESULTS ALL ROWS)) x;
  count 
-------
  2898
(1 row)

VDB=> \g
 count 
-------
  2892
(1 row)

VDB=> \g
 count 
-------
  2910
(1 row)

Daniel_Leybovic · August 2014

Hi!

Be careful MATCH-PATTERN with bugs - fails to find all paths

Example:

daniel=> select * from Path;
 user  | event_time | event 
-------+------------+-------
 USR_1 | 2000-01-01 | A
 USR_1 | 2000-01-02 | C
 USR_1 | 2000-01-03 | B
 USR_1 | 2000-01-04 | A
(4 rows)

So user USR_1 did A -> C -> B -> A.

Ok, now lets find ACB:

select
    "user",
    event,
    event_name(),
    pattern_id(),
    match_id()
from
    public.Path
match (
    partition by "user" order by event_time
    define
        e1 as event = 'A',
        e2 as event = 'C',
        e3 as event = 'B'
    pattern
        p as (e1 e2 e3)
        rows match first event
);
daniel=> \e
 user  | event | event_name | pattern_id | match_id 
-------+-------+------------+------------+----------
 USR_1 | A     | e1         |          1 |        1
 USR_1 | C     | e2         |          1 |        2
 USR_1 | B     | e3         |          1 |        3
(3 rows)

Found and now lets find a full path ACBA:

daniel=> \p
select
    "user",
    event,
    event_name(),
    pattern_id(),
    match_id()
from
    public.Path
match (
    partition by "user" order by event_time
    define
        e1 as event = 'A',
        e2 as event = 'C',
        e3 as event = 'B',
        e4 as event = 'A'
    pattern
        p as (e1 e2 e3 e4)
        rows match first event
);
daniel=> \e
 user | event | event_name | pattern_id | match_id 
------+-------+------------+------------+----------
(0 rows)

Not found, ok lets try one more time with small change:

daniel=> select
daniel->     "user",
daniel->     event,
daniel->     event_name(),
daniel->     pattern_id(),
daniel->     match_id()
daniel-> from
daniel->     public.Path
daniel-> match (
daniel(>     partition by "user" order by event_time
daniel(>     define
daniel(>         e1 as event = 'A',
daniel(>         e2 as event = 'C',
daniel(>         e3 as event = 'B',
daniel(>         e4 as event = 'A'
daniel(>     pattern
daniel(>         p as (e1 e2 e3 e4*)
daniel(>         rows match first event
daniel(> );
 user  | event | event_name | pattern_id | match_id 
-------+-------+------------+------------+----------
 USR_1 | A     | e1         |          1 |        1
 USR_1 | C     | e2         |          1 |        2
 USR_1 | B     | e3         |          1 |        3
(3 rows)

One more time failed to find a full path.

Hm... it's interesting... Do anything that unique to Vertica works properly? MATCH-PATTERN, TIMESERIES ? Does Vertica do unit testing? A simple test fails.

PS

I tried:

RESULTS ALL ROWS
rows match first event
rows match all events

Nothing, all of them fails

Good Luck!

[Deleted User] · August 2014

Hi Daniel -- as discussed when you previously posted these results, it would be very helpful if you could send a standalone reproducer that works in a clean environment where you know that the table itself is not changing. We have not seen this behavior in our in-house tests.

Regarding the original draft of your post (which was e-mailed out, though I see you have revised the online version): The query there illustrates an important misunderstanding about MATCH; I'd like to clarify, as I think it will help others and is likely to be a common point of confusion.

Looking at your original example:

select * from Path;
 user  | event_time | event 
-------+------------+-------
 USR_1 | 2000-01-01 | e1
 USR_1 | 2000-01-02 | e3
 USR_1 | 2000-01-03 | e2
 USR_1 | 2000-01-04 | e1
(4 rows)

select
    "user",
    event,
    event_name(),
    pattern_id(),
    match_id()
from
    public.Path
match (
    partition by "user" order by event_time
    define
        A as event = 'e1',
        B as event = 'e3',
        C as event = 'e2',
        D as event = 'e1'
    pattern
        p as (A B C D)
        rows match first event
);
daniel=> \e
 user | event | event_name | pattern_id | match_id 
------+-------+------------+------------+----------
(0 rows)

(Note that I have rewritten the match-clause names to be single characters; that's for illustrative purposes.) You might expect that this query should match this table exactly. Well, it doesn't, and it's not supposed to.

The question that is interesting is, what happens if you have two match clauses that both match a particular row? In this case, every time that A matches, D also matches.

This case is kind of a base case; really an instance of "so don't do that!" You can trivially rewrite the pattern as "(A B C A)", replace all instances of "D" with "A", and you will find that the pattern does match the contents of the table.

Why? Because MATCH computes a match string, not a match graph.

Match clauses can be anything. For example, what if you had "D as event LIKE 'e%'". This will, of course, match any row that also matches A, B, or C. These are extremely simple examples; both the match clauses and the patterns can get much more complex. Once you allow for the possibility of duplicate patterns matching a particular row, possibly in the middle of a long complex pattern, etc., the complexity of the resulting analysis increases substantially.

So, we make a performance trade-off: MATCH is specified to only consider the first match-clause that matches any given row. So, "D as event LIKE 'e%'" would only match all 'e%'s that weren't matched by preceding clauses, ie., 'e1', 'e3', or 'e2'.

(Note that, if your expressions are all independent and if one is improbable and expensive to evaluate, this also lets you avoid computing the improbable one most of the time by moving it to the end of the list. This optimization is not possible if we must consider every match clause as a possible match for every row. Again, all about performance in exchange for flexibility.)

Think of it like converting your table into a string; one character per row in the table. The first row above matches the first match-clause, so we append 'A' to the string. The second row matches the second match-clause, so we append 'B'. The third matches the third; so, 'C'. The fourth ... well, starting from the top, the first matching clause is 'A', so we append 'A'. Having scanned the whole table, we end up with "ABCA". Does that match the pattern "(ABCD)"? No. So, no match.

Both variants of this syntax have their uses. Right now, we only support the one described above. If you would like the other, feel free to post to the "Ideas" section of the site.

Daniel_Leybovic · August 2014

Hi!

My bad, I confused because:

PATTERN defines the regular expression composed of event types that you specify in the DEFINE subclause.

http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/SELECT/MATCH...
actually, it isn't true, PATTERN behaves like SWTCH-CASE(C/C++) with break and of cause will work only a first CASE with break.

Massimo_Loporch · September 2014

Last update,
the calculation works perfectly and for a billion records table
we get the expected response in just 3 seconds for every shop
Great!

The final code is below:

SELECT ID_Pdv, ID_Articolo, ID_Data ,
VAL_QuantitàDiffInventario,VAL_QuantitàAcquisto , VAL_QuantitàVendita , VAL_QuantitàDistruzioni, VAL_QuantitàVenditaPreInv,
LAG(Giacenza_Finale,1,0) OVER(partition by ID_Pdv, ID_Articolo order by ID_Data asc) as Giacenza_Iniziale ,
Giacenza_Finale

INTO MEGAMONITOR.dt_pdv_fatti_aggr_day_gia"+context.casual+"

FROM
(SELECT ID_Pdv, ID_Articolo, ID_Data, VAL_QuantitàDiffInventario,VAL_QuantitàAcquisto , VAL_QuantitàVendita , VAL_QuantitàDistruzioni, VAL_QuantitàVenditaPreInv,
SUM(dati.VAL_QuantitàDiffInventario + VAL_QuantitàAcquisto - VAL_QuantitàVendita - VAL_QuantitàDistruzioni)
OVER(PARTITION BY ID_Pdv, ID_Articolo, p_id ORDER BY ID_Data ) Giacenza_Finale

FROM
(SELECT f.ID_Pdv, ID_Articolo, f.ID_Data,
f.VAL_QuantitàDiffInventario,f.VAL_QuantitàAcquisto ,
case when f.VAL_QuantitàDiffInventario >0 and f.VAL_QuantitàVendita >0 then 0 else f.VAL_QuantitàVendita end as VAL_QuantitàVendita,
case when f.VAL_QuantitàDiffInventario >0 and f.VAL_QuantitàVendita >0 then f.VAL_QuantitàVendita else 0 end as VAL_QuantitàVenditaPreInv,
f.VAL_QuantitàDistruzioni, pattern_id() as p_id
FROM MEGAMONITOR.DT_PDV_FATTI_AGGR f INNER JOIN MEGAMONITOR.AN_PDV a ON f.ID_Pdv=a.ID_Pdv
WHERE f.ID_Data between '"+context.data+"' and '"+context.datafi+"' and f.ID_Pdv like '"+context.cliente+"' and a.FL_Diretto like '"+context.diretto+"'
MATCH (partition by f.ID_Pdv, ID_Articolo order by f.ID_Data
DEFINE
inventory as f.VAL_QuantitàDiffInventario > 0,
no_inventory as f.VAL_QuantitàDiffInventario = 0
PATTERN
P AS (inventory* no_inventory*))
) dati
) giacenza
order by ID_Pdv, ID_Articolo , ID_Data

We're Moving!

Create My New Community Account Now

Analytic function for conditional reset cumulative sum

Comments

Leave a Comment