Analytic function for conditional reset cumulative sum

I need to implement an analytic function that realizes a cumulative sum that can be reset if a second column has a value greater than zero, in this case the sum restarts from the value of the second column. I need some kind of select cumsum(case when col1>0 then reset to col1 else col2+col3 end) from table group by. ... any code sample in c++ uda ?

Comments

  • Navin_CNavin_C Vertica Customer
    Hi Massimo,

    Can you post some example of resultset expected along with columns.
  • Hi Navin...

    ok this an example, but my table has a 2 billion rows:

    ID_ITEM   DATE    FISICAL_INVENTORY   QTA_BUYED  QTA_SOLD    DAY_INVENTORY
                                                                                                                     (what i need to compute)

    item1       1/1/2014         100                              0                       0                       100
    item1       2/1/2014             0                             10                      90                       20
           changin item reset the cumulative sum or using group by clause 
    item2       1/1/2014             0                             10                       0                        10
    item2       2/2/2014             0                               0                       0                        10
    item2       3/1/2014             0                           110                     10                      110
    item2       4/1/2014           20                             10                       5                        25           inv. diff= -90
    item2       5/1/2014             0                               0                       5                        20
    ....

    I hope to have an help to write a c++ uda function in vertica that i think can be the faster and on-line solution.

    Regards.
  • I was not able to follow the example on how to compute day_inventory. Analytics sum are providing cumulative sums by default, but you may need to tweak the partition clause and add a case statement for a secondary column reset ( case col2 > 0 then 1 else 0) as change_cond , partition by item1, change_cond.
  • Hi!

    Hm... why second row (item1, 2/1/2014) returns 20?
    By your CASE definition: col1 = 0 => return col2 + col3 =  100 => cumsum = 100 + 100 = 200

    why 4/1/2014  return 25 and not 20?
    By your CASE definition: col1 = 20 =>  col1 > 0 => reset to col1 => result = 20, but it actually = col1 + col2 + col3 = 25. O_o
    if a second column has a value greater than zero, in this case the sum restarts from the value of the second column
    So col1 or col2? Do not match your SQL case definition.

    Best
  • Sorry,
    qta_sold has always a negative value,
    qta_buyed is always positive 

    Tank You!
  • Hi colin, thanks for you suggest ...
    but .... it will not work  ...

    To create the dataset:

    CREATE TABLE TM_MYTABLE ( id_item Varchar(100),
    "date" Date,
    fiscal_inventory Varchar(100),
    qta_buyed Float,
    qta_sold Float,
    day_inventory Float  );

    insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
    insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
    insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
    insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
    insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
    insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);  
    insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);

     id_item date fiscal_inventory qta_buyed qta_sold day_inventory (expected)
    item1 01/01/2014 100 0 0 100
    item1 02/01/2014 0 10 -90 20
    item2 01/01/2014 0 10 0 10
    item2 02/01/2014 0 0 0 10
    item2 03/01/2014 0 110 -10 110
    item2 04/01/2014 20 10 -5 25
    item2 05/01/2014 0 0 -5 20

    The following select does not return the right number :

    SELECT id_item, "date", fiscal_inventory, qta_buyed, qta_sold, day_inventory expected,sum   (fiscal_inventory+ qta_buyed+qta_sold)  
    over (partition by id_item , case when fiscal_inventory>0 then 1 else 0 end 
          order by "date") as day_inventory
    FROM TM_MYTABLE
    order by id_item, "date"


    will give me this dataset :

     id_item date fiscal_inventory qta_buyed qta_sold expected_result day_inventory

    item1 01/01/2014 100 0 0 100 100
    item1 02/01/2014 0 10 -90 20 (expected)  -80 (wrong)
    (correct : 100 as previous day_inventory + 10 - 90 = 20, and not -80)
    This is because it is partitioned by 1 if fiscal_inventory > 0 and 0 if not.

    item2 01/01/2014 0 10 0 10 10
    item2 02/01/2014 0 0 0 10 10
    item2 03/01/2014 0 110 -10 110 110
    item2 04/01/2014 20 10 -5 25 25
    item2 05/01/2014 0 0 -5 20 105 (wrong)


  • CREATE TABLE TM_MYTABLE ( id_item Varchar(100),
    "date" Date,
    fiscal_inventory Varchar(100),
    qta_buyed Float,
    qta_sold Float,
    day_inventory Float  );

    insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
    insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
    insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
    insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
    insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
    insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);  
    insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);
  • CREATE TABLE TM_MYTABLE ( id_item Varchar(100),
    "date" Date,
    fiscal_inventory Varchar(100),
    qta_buyed Float,
    qta_sold Float,
    day_inventory Float  );

    insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
    insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
    insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
    insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
    insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
    insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);  
    insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);
  • Hi!

    You need Transform function, but Transform function in Vertica has limitation:
    UDTFs can only be used in the SELECT list that contains just the UDTF call and a required OVER clause.
    i.e, you can't get other columns except CumSum. Is it good for you? =>
    select ConditionalCumSum(col1, col2, col3) over (partition by item order by date) from inventory;
     CUMSUM
    --------
        100
         20
         10
          10
        110
         20
         15
    (7 rows)

    PS
    Still doesn't match (2 last rows). Can you explain - why 25 (item2       4/1/2014)  and not 20?
  • For what it's worth, in recent Vertica versions, I believe that columns in the UDT's PARTITION BY expression can also be fetched in the SELECT clause. 

    The reason you can't emit other columns with a UDT is that a UDT doesn't have a 1:1 correspondence between input rows and output rows.  You could read in a million rows and emit just one row per partition, like a per-partition aggregate function.  Or the reverse -- for example, a tokenizer UDT like the one on our github site, that takes in a collection of text documents and emits one row per word in any document.  In either of these cases, for fields not in the UDT, how do we map values from input rows back onto output rows?

    For simple cases, take a look at UDAnalytic functions.  They are like UDT's but enforce a 1:1 correspondence between input and output rows.  So you can select other columns in addition to the output of the function.

    For more complex cases, a simple workaround is to pass the extra columns into your UDT and have it decide what values to emit for what rows.

    Also -- UDT's can be used in subqueries.  So you can fetch a set of columns with a UDT in the inner query, then use the outer query to join against additional tables, apply filters, rearrange columns, etc.

    Adam
  • Hi Massimo,

    Identifying patterns like this actually sounds like a potential use case for Vertica's MATCH SQL extension, intended for event-series analysis:

    http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/SELECT/MATCH...

    MATCH clauses are like regular expressions that scan down through your rows of data.  You can easily write expressions to identify this sort of pattern.  (Let me know if you're having a hard time with the syntax; I or others here can probably post examples.)

    If the sum is reset every time col1 is nonzero, it sounds like what you want are patterns of the form "(col1 is nonzero) (col1 is zero)*".  So, write that pattern; then *within each partition*, because you know that you start with col1 being nonzero and it's always zero after that until the end of the partition, just do something like "sum(case when col1 > 0 then col1 else col2 end)".

    Adam
  • I Daniel,
    for your question :
    <<Still doesn't match (2 last rows). Can you explain - why 25 (item2       4/1/2014)  and not 20?>>
    the answer is 
    the 4/1/2014 item2 has an inventory for 20 pz ... so my count must restart from 20, then I buyed 10 and sold 5 so 20 + 10 - 5 = 25 at the end of the day,  the fact that I have a difference of 90 is the inventory difference from the fiscal and fisical inventory.


  • Im tired, or I can't understand or you can't define your problem.

    when col1>0 then reset to col1 else col2+col3 end

    col1 = 20 > 0 => reset to col1 => 20.


    Good Luck, may be Adam can help you.
  • Great Adam !!!
    WOW !

    select id_item, "date", fiscal_inventory, qta_buyed, qta_sold, expected, 
    sum   (fiscal_inventory+qta_buyed+qta_sold)  over(partition by id_item,p_id order by "date" )
     from 
    (
    SELECT id_item, "date", fiscal_inventory, qta_buyed, qta_sold, day_inventory expected, 
    pattern_id() as p_id
    FROM TM_MYTABLE
    MATCH (partition by id_item order by "date"
    DEFINE
    inventory as fiscal_inventory > 0,
    no_inventory as fiscal_inventory = 0
    PATTERN
    P AS (inventory* no_inventory*)
    )
    order by id_item, "date"

    ) dati

    group by id_item, "date", fiscal_inventory, qta_buyed, qta_sold, expected , p_id



    id_item date fiscal_inventory qta_buyed qta_sold expected DAY_INVENTORY
    item1 01/01/2014 100 0 0 100 100
    item1 02/01/2014 0 10 -90 20 20
    item2 01/01/2014 0 10 0 10 10
    item2 02/01/2014 0 0 0 10 10
    item2 03/01/2014 0 110 -10 110 110
    item2 04/01/2014 20 10 -5 25 25
    item2 05/01/2014 0 0 -5 20 20

    Now expected computation and DAY_INVENTORY are ok!

    Now I don't know if the resources will be sufficient for my "1 billion record table"

    Thank you all for the help

    Massimo.


  • No no you are right ... my problem definition is not perfect ... reset to (column 1 + column 2+column 3) become after the inventory the item can be sold or buyed ... sorry again Massimo
  • Navin_CNavin_C Vertica Customer
    Hi Massimo,

    After a lot brainstorming on this scenario.Me and my colleagues came up with this.

    First creating a temp table :
    CREATE TABLE NAVIN.test_tmp as SELECT A.ID_ITEM,isnull(A.RNK,100) rnk,B.ITEM_DATE
    FROM
    (
       SELECT B.*,RNK FROM
       ( SELECT * FROM(             SELECT *,CASE WHEN FISCAL_INVENTORY > 0 THEN  RANK() OVER(PARTITION BY ID_ITEM ORDER BY ITEM_DATE) END RNK  FROM NAVIN.ITEM     )A     WHERE RNK IS NOT NULL      
           )
           A      RIGHT JOIN    (SELECT * FROM NAVIN.ITEM) B ON A.ID_ITEM = B.ID_ITEM AND A.ITEM_DATE <= B.ITEM_DATE
    )A
    LEFT JOIN ( SELECT B.*,RNK FROM ( SELECT * FROM( SELECT *,CASE WHEN FISCAL_INVENTORY > 0 THEN  RANK() OVER(PARTITION BY ID_ITEM ORDER BY ITEM_DATE) END RNK FROM NAVIN.ITEM 
                                )A WHERE RNK IS NOT NULL )
                               A RIGHT JOIN (SELECT * FROM NAVIN.ITEM) B ON A.ID_ITEM = B.ID_ITEM AND A.ITEM_DATE <= B.ITEM_DATE
                    )B ON A.ID_ITEM = B.ID_ITEM AND isnull(A.RNK,100) = isnull(B.RNK,100) AND A.ITEM_DATE <= B.ITEM_DATE
    GROUP BY A.ID_ITEM,A.RNK,B.ITEM_DATE
    Then using this query:
    select a.ID_ITEM,a.rnk,b.item_date,sum(a.FISCAL_INVENTORY),sum(a.QTA_BUYED),sum(a.qta_sold),sum(a.FISCAL_INVENTORY)+sum(a.QTA_BUYED)-sum(a.qta_sold) from (select a.*,b.rnk from navin.item a join navin.test_tmp b on a.id_item=b.id_item and a.item_date = b.item_date)a
    left join navin.test_tmp b on a.rnk = b.rnk and a.ID_ITEM = b.ID_ITEM and a.item_date <=b.item_date
    group by a.ID_ITEM,a.rnk,b.item_date
    order by 1,3,2
     ID_ITEM | rnk | item_date  | sum | sum | sum | ?column?---------+-----+------------+-----+-----+-----+----------
     item1   |   1 | 2014-01-01 | 100 |   0 |   0 |      100
     item1   |   1 | 2014-01-02 | 100 |  10 |  90 |       20
     item2   | 100 | 2014-01-01 |   0 |  10 |   0 |       10
     item2   | 100 | 2014-01-02 |   0 |  10 |   0 |       10
     item2   | 100 | 2014-01-03 |   0 | 120 |  10 |      110
     item2   |   4 | 2014-01-04 |  20 |  10 |   5 |       25
     item2   |   4 | 2014-01-05 |  20 |  10 |  10 |       20
    However, I wish I would know about MATCH clause in beginning itself.
    Truely Vertica is an analytic platform :) 

    Hope this helps.
    NC
  • I found that MATCH-PATTERN isn't deterministic  - can return different results for same data set.
    Good Luck
    VDB=> select count(*) from  (^J        SELECT^J            internal_user_id AS uid^J         FROM^J            sch.tbl^J        WHERE^J             session_event_time_ts > '2014-06-23'^J                 and^J            session_event_time_ts < '2014-06-25'^J        MATCH  (^J            PARTITION BY internal_user_id^J            ORDER BY  session_event_time_ts^J            DEFINE^J                E1    AS session_event_code='S',^J                E2    AS session_event_code='A',^J                E3    AS session_event_code='R'^J            PATTERN P3 AS (E1 E2 E3) RESULTS ALL ROWS)) x;
     count
    -------
      2898
    (1 row)

    VDB=> \g
     count
    -------
      2892
    (1 row)

    VDB=> \g
     count
    -------
      2910
    (1 row)
  • Hi!

    Be careful MATCH-PATTERN with bugs  - fails to find all paths

    Example:
    daniel=> select * from Path;
    user | event_time | event
    -------+------------+-------
    USR_1 | 2000-01-01 | A
    USR_1 | 2000-01-02 | C
    USR_1 | 2000-01-03 | B
    USR_1 | 2000-01-04 | A
    (4 rows)
    So user USR_1 did A -> C -> B -> A.


    Ok, now lets find ACB:
    select
    "user",
    event,
    event_name(),
    pattern_id(),
    match_id()
    from
    public.Path
    match (
    partition by "user" order by event_time
    define
    e1 as event = 'A',
    e2 as event = 'C',
    e3 as event = 'B'
    pattern
    p as (e1 e2 e3)
    rows match first event
    );
    daniel=> \e
    user | event | event_name | pattern_id | match_id
    -------+-------+------------+------------+----------
    USR_1 | A | e1 | 1 | 1
    USR_1 | C | e2 | 1 | 2
    USR_1 | B | e3 | 1 | 3
    (3 rows)

    Found and now lets find a full path ACBA:
    daniel=> \p
    select
    "user",
    event,
    event_name(),
    pattern_id(),
    match_id()
    from
    public.Path
    match (
    partition by "user" order by event_time
    define
    e1 as event = 'A',
    e2 as event = 'C',
    e3 as event = 'B',
    e4 as event = 'A'
    pattern
    p as (e1 e2 e3 e4)
    rows match first event
    );
    daniel=> \e
    user | event | event_name | pattern_id | match_id
    ------+-------+------------+------------+----------
    (0 rows)
    Not found, ok lets try one more time with small change:
    daniel=> select
    daniel-> "user",
    daniel-> event,
    daniel-> event_name(),
    daniel-> pattern_id(),
    daniel-> match_id()
    daniel-> from
    daniel-> public.Path
    daniel-> match (
    daniel(> partition by "user" order by event_time
    daniel(> define
    daniel(> e1 as event = 'A',
    daniel(> e2 as event = 'C',
    daniel(> e3 as event = 'B',
    daniel(> e4 as event = 'A'
    daniel(> pattern
    daniel(> p as (e1 e2 e3 e4*)
    daniel(> rows match first event
    daniel(> );
    user | event | event_name | pattern_id | match_id
    -------+-------+------------+------------+----------
    USR_1 | A | e1 | 1 | 1
    USR_1 | C | e2 | 1 | 2
    USR_1 | B | e3 | 1 | 3
    (3 rows)
    One more time failed to find a full path.

    Hm... it's interesting... Do anything that unique to Vertica works properly? MATCH-PATTERN, TIMESERIES ? Does Vertica do unit testing? A simple test fails.

    PS

    I tried:
    • RESULTS ALL ROWS
    • rows match first event
    • rows match all events
    Nothing, all of them fails


    Good Luck!


  • Hi Daniel -- as discussed when you previously posted these results, it would be very helpful if you could send a standalone reproducer that works in a clean environment where you know that the table itself is not changing.  We have not seen this behavior in our in-house tests.

    Regarding the original draft of your post (which was e-mailed out, though I see you have revised the online version):  The query there illustrates an important misunderstanding about MATCH; I'd like to clarify, as I think it will help others and is likely to be a common point of confusion.

    Looking at your original example:
    select * from Path;
    user | event_time | event
    -------+------------+-------
    USR_1 | 2000-01-01 | e1
    USR_1 | 2000-01-02 | e3
    USR_1 | 2000-01-03 | e2
    USR_1 | 2000-01-04 | e1
    (4 rows)

    select
    "user",
    event,
    event_name(),
    pattern_id(),
    match_id()
    from
    public.Path
    match (
    partition by "user" order by event_time
    define
    A as event = 'e1',
    B as event = 'e3',
    C as event = 'e2',
    D as event = 'e1'
    pattern
    p as (A B C D)
    rows match first event
    );
    daniel=> \e
    user | event | event_name | pattern_id | match_id
    ------+-------+------------+------------+----------
    (0 rows)
    (Note that I have rewritten the match-clause names to be single characters; that's for illustrative purposes.)  You might expect that this query should match this table exactly.  Well, it doesn't, and it's not supposed to.

    The question that is interesting is, what happens if you have two match clauses that both match a particular row?  In this case, every time that A matches, D also matches.

    This case is kind of a base case; really an instance of "so don't do that!"  You can trivially rewrite the pattern as "(A B C A)", replace all instances of "D" with "A", and you will find that the pattern does match the contents of the table.

    Why?  Because MATCH computes a match string, not a match graph.

    Match clauses can be anything.  For example, what if you had "D as event LIKE 'e%'".  This will, of course, match any row that also matches A, B, or C.  These are extremely simple examples; both the match clauses and the patterns can get much more complex.  Once you allow for the possibility of duplicate patterns matching a particular row, possibly in the middle of a long complex pattern, etc., the complexity of the resulting analysis increases substantially.

    So, we make a performance trade-off:  MATCH is specified to only consider the first match-clause that matches any given row.  So, "D as event LIKE 'e%'" would only match all 'e%'s that weren't matched by preceding clauses, ie., 'e1', 'e3', or 'e2'.

    (Note that, if your expressions are all independent and if one is improbable and expensive to evaluate, this also lets you avoid computing the improbable one most of the time by moving it to the end of the list.  This optimization is not possible if we must consider every match clause as a possible match for every row.  Again, all about performance in exchange for flexibility.)


    Think of it like converting your table into a string; one character per row in the table.  The first row above matches the first match-clause, so we append 'A' to the string.  The second row matches the second match-clause, so we append 'B'.  The third matches the third; so, 'C'.  The fourth ... well, starting from the top, the first matching clause is 'A', so we append 'A'.  Having scanned the whole table, we end up with "ABCA".  Does that match the pattern "(ABCD)"?  No.  So, no match.


    Both variants of this syntax have their uses.  Right now, we only support the one described above.  If you would like the other, feel free to post to the "Ideas" section of the site.
  • Hi!

    My bad, I confused because:
    PATTERN defines the regular expression composed of event types that you specify in the DEFINE subclause.
    http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/SELECT/MATCH...
    actually, it isn't true, PATTERN behaves like SWTCH-CASE(C/C++) with break and of cause will work only a first CASE with break.
  • Last update,
    the calculation works perfectly and for a billion records table
    we get the expected response in just 3 seconds for every shop
    Great!

    The final code is below:

    SELECT  ID_Pdv, ID_Articolo, ID_Data , 
    VAL_QuantitàDiffInventario,VAL_QuantitàAcquisto , VAL_QuantitàVendita , VAL_QuantitàDistruzioni,  VAL_QuantitàVenditaPreInv,
    LAG(Giacenza_Finale,1,0) OVER(partition by ID_Pdv, ID_Articolo order by ID_Data asc) as Giacenza_Iniziale , 
    Giacenza_Finale

    INTO MEGAMONITOR.dt_pdv_fatti_aggr_day_gia"+context.casual+"

    FROM  
    (SELECT ID_Pdv, ID_Articolo, ID_Data, VAL_QuantitàDiffInventario,VAL_QuantitàAcquisto , VAL_QuantitàVendita , VAL_QuantitàDistruzioni,  VAL_QuantitàVenditaPreInv,
    SUM(dati.VAL_QuantitàDiffInventario + VAL_QuantitàAcquisto - VAL_QuantitàVendita - VAL_QuantitàDistruzioni) 
        OVER(PARTITION BY ID_Pdv, ID_Articolo, p_id ORDER BY ID_Data ) Giacenza_Finale

    FROM  
    (SELECT f.ID_Pdv, ID_Articolo, f.ID_Data, 
     f.VAL_QuantitàDiffInventario,f.VAL_QuantitàAcquisto , 
    case when  f.VAL_QuantitàDiffInventario >0 and f.VAL_QuantitàVendita >0 then 0 else f.VAL_QuantitàVendita end as VAL_QuantitàVendita,
    case when  f.VAL_QuantitàDiffInventario >0 and f.VAL_QuantitàVendita >0 then f.VAL_QuantitàVendita  else 0 end as VAL_QuantitàVenditaPreInv,
    f.VAL_QuantitàDistruzioni, pattern_id() as p_id
     FROM MEGAMONITOR.DT_PDV_FATTI_AGGR f INNER JOIN MEGAMONITOR.AN_PDV a ON f.ID_Pdv=a.ID_Pdv
     WHERE  f.ID_Data between '"+context.data+"' and '"+context.datafi+"' and f.ID_Pdv like '"+context.cliente+"' and a.FL_Diretto like '"+context.diretto+"'
     MATCH (partition by f.ID_Pdv, ID_Articolo order by f.ID_Data
     DEFINE
     inventory as f.VAL_QuantitàDiffInventario > 0,
     no_inventory as f.VAL_QuantitàDiffInventario = 0
     PATTERN
     P AS (inventory* no_inventory*))
    ) dati
    ) giacenza
    order by ID_Pdv,  ID_Articolo , ID_Data


Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file