Analytic function for conditional reset cumulative sum
I need to implement an analytic function that realizes a cumulative sum that can be reset if a second column has a value greater than zero, in this case the sum restarts from the value of the second column. I need some kind of select cumsum(case when col1>0 then reset to col1 else col2+col3 end) from table group by. ... any code sample in c++ uda ?
0
Comments
Can you post some example of resultset expected along with columns.
ok this an example, but my table has a 2 billion rows:
ID_ITEM DATE FISICAL_INVENTORY QTA_BUYED QTA_SOLD DAY_INVENTORY
(what i need to compute)
item1 1/1/2014 100 0 0 100
item1 2/1/2014 0 10 90 20
changin item reset the cumulative sum or using group by clause
item2 1/1/2014 0 10 0 10
item2 2/2/2014 0 0 0 10
item2 3/1/2014 0 110 10 110
item2 4/1/2014 20 10 5 25 inv. diff= -90
item2 5/1/2014 0 0 5 20
....
I hope to have an help to write a c++ uda function in vertica that i think can be the faster and on-line solution.
Regards.
Hm... why second row (item1, 2/1/2014) returns 20?
By your CASE definition: col1 = 0 => return col2 + col3 = 100 => cumsum = 100 + 100 = 200
why 4/1/2014 return 25 and not 20?
By your CASE definition: col1 = 20 => col1 > 0 => reset to col1 => result = 20, but it actually = col1 + col2 + col3 = 25. O_o So col1 or col2? Do not match your SQL case definition.
Best
qta_sold has always a negative value,
qta_buyed is always positive
Tank You!
but .... it will not work ...
To create the dataset:
CREATE TABLE TM_MYTABLE ( id_item Varchar(100),
"date" Date,
fiscal_inventory Varchar(100),
qta_buyed Float,
qta_sold Float,
day_inventory Float );
insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);
insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);
id_item date fiscal_inventory qta_buyed qta_sold day_inventory (expected)
item1 01/01/2014 100 0 0 100
item1 02/01/2014 0 10 -90 20
item2 01/01/2014 0 10 0 10
item2 02/01/2014 0 0 0 10
item2 03/01/2014 0 110 -10 110
item2 04/01/2014 20 10 -5 25
item2 05/01/2014 0 0 -5 20
The following select does not return the right number :
SELECT id_item, "date", fiscal_inventory, qta_buyed, qta_sold, day_inventory expected,sum (fiscal_inventory+ qta_buyed+qta_sold)
over (partition by id_item , case when fiscal_inventory>0 then 1 else 0 end
order by "date") as day_inventory
FROM TM_MYTABLE
order by id_item, "date"
will give me this dataset :
id_item date fiscal_inventory qta_buyed qta_sold expected_result day_inventory
item1 01/01/2014 100 0 0 100 100
item1 02/01/2014 0 10 -90 20 (expected) -80 (wrong)
(correct : 100 as previous day_inventory + 10 - 90 = 20, and not -80)
This is because it is partitioned by 1 if fiscal_inventory > 0 and 0 if not.
item2 01/01/2014 0 10 0 10 10
item2 02/01/2014 0 0 0 10 10
item2 03/01/2014 0 110 -10 110 110
item2 04/01/2014 20 10 -5 25 25
item2 05/01/2014 0 0 -5 20 105 (wrong)
"date" Date,
fiscal_inventory Varchar(100),
qta_buyed Float,
qta_sold Float,
day_inventory Float );
insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);
insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);
"date" Date,
fiscal_inventory Varchar(100),
qta_buyed Float,
qta_sold Float,
day_inventory Float );
insert into TM_MYTABLE values('item1','2014-1-1',100, 0,0,100);
insert into TM_MYTABLE values('item1','2014-1-2',0,10,-90,20);
insert into TM_MYTABLE values('item2','2014-1-1',0,10,0,10);
insert into TM_MYTABLE values('item2','2014-1-2',0,0,0,10);
insert into TM_MYTABLE values('item2','2014-1-3',0,110,-10 , 110);
insert into TM_MYTABLE values('item2','2014-1-4',20,10,-5,25);
insert into TM_MYTABLE values('item2','2014-1-5',0,0,-5,20);
You need Transform function, but Transform function in Vertica has limitation: i.e, you can't get other columns except CumSum. Is it good for you? => CUMSUM
--------
100
20
10
10
110
20
15
(7 rows)
PS
Still doesn't match (2 last rows). Can you explain - why 25 (item2 4/1/2014) and not 20?
The reason you can't emit other columns with a UDT is that a UDT doesn't have a 1:1 correspondence between input rows and output rows. You could read in a million rows and emit just one row per partition, like a per-partition aggregate function. Or the reverse -- for example, a tokenizer UDT like the one on our github site, that takes in a collection of text documents and emits one row per word in any document. In either of these cases, for fields not in the UDT, how do we map values from input rows back onto output rows?
For simple cases, take a look at UDAnalytic functions. They are like UDT's but enforce a 1:1 correspondence between input and output rows. So you can select other columns in addition to the output of the function.
For more complex cases, a simple workaround is to pass the extra columns into your UDT and have it decide what values to emit for what rows.
Also -- UDT's can be used in subqueries. So you can fetch a set of columns with a UDT in the inner query, then use the outer query to join against additional tables, apply filters, rearrange columns, etc.
Adam
Identifying patterns like this actually sounds like a potential use case for Vertica's MATCH SQL extension, intended for event-series analysis:
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/SELECT/MATCH...
MATCH clauses are like regular expressions that scan down through your rows of data. You can easily write expressions to identify this sort of pattern. (Let me know if you're having a hard time with the syntax; I or others here can probably post examples.)
If the sum is reset every time col1 is nonzero, it sounds like what you want are patterns of the form "(col1 is nonzero) (col1 is zero)*". So, write that pattern; then *within each partition*, because you know that you start with col1 being nonzero and it's always zero after that until the end of the partition, just do something like "sum(case when col1 > 0 then col1 else col2 end)".
Adam
for your question :
<<Still doesn't match (2 last rows). Can you explain - why 25 (item2 4/1/2014) and not 20?>>
the answer is
the 4/1/2014 item2 has an inventory for 20 pz ... so my count must restart from 20, then I buyed 10 and sold 5 so 20 + 10 - 5 = 25 at the end of the day, the fact that I have a difference of 90 is the inventory difference from the fiscal and fisical inventory.
when col1>0 then reset to col1 else col2+col3 end
col1 = 20 > 0 => reset to col1 => 20.
Good Luck, may be Adam can help you.
WOW !
select id_item, "date", fiscal_inventory, qta_buyed, qta_sold, expected,
sum (fiscal_inventory+qta_buyed+qta_sold) over(partition by id_item,p_id order by "date" )
from
(
SELECT id_item, "date", fiscal_inventory, qta_buyed, qta_sold, day_inventory expected,
pattern_id() as p_id
FROM TM_MYTABLE
MATCH (partition by id_item order by "date"
DEFINE
inventory as fiscal_inventory > 0,
no_inventory as fiscal_inventory = 0
PATTERN
P AS (inventory* no_inventory*)
)
order by id_item, "date"
) dati
group by id_item, "date", fiscal_inventory, qta_buyed, qta_sold, expected , p_id
id_item date fiscal_inventory qta_buyed qta_sold expected DAY_INVENTORY
item1 01/01/2014 100 0 0 100 100
item1 02/01/2014 0 10 -90 20 20
item2 01/01/2014 0 10 0 10 10
item2 02/01/2014 0 0 0 10 10
item2 03/01/2014 0 110 -10 110 110
item2 04/01/2014 20 10 -5 25 25
item2 05/01/2014 0 0 -5 20 20
Now expected computation and DAY_INVENTORY are ok!
Now I don't know if the resources will be sufficient for my "1 billion record table"
Thank you all for the help
Massimo.
After a lot brainstorming on this scenario.Me and my colleagues came up with this.
First creating a temp table : Then using this query: However, I wish I would know about MATCH clause in beginning itself.
Truely Vertica is an analytic platform
Hope this helps.
NC
Good Luck
Be careful MATCH-PATTERN with bugs - fails to find all paths
Example: So user USR_1 did A -> C -> B -> A.
Ok, now lets find ACB: Found and now lets find a full path ACBA: Not found, ok lets try one more time with small change: One more time failed to find a full path.
Hm... it's interesting... Do anything that unique to Vertica works properly? MATCH-PATTERN, TIMESERIES ? Does Vertica do unit testing? A simple test fails.
PS
I tried:
- RESULTS ALL ROWS
- rows match first event
- rows match all events
Nothing, all of them failsGood Luck!
Regarding the original draft of your post (which was e-mailed out, though I see you have revised the online version): The query there illustrates an important misunderstanding about MATCH; I'd like to clarify, as I think it will help others and is likely to be a common point of confusion.
Looking at your original example: (Note that I have rewritten the match-clause names to be single characters; that's for illustrative purposes.) You might expect that this query should match this table exactly. Well, it doesn't, and it's not supposed to.
The question that is interesting is, what happens if you have two match clauses that both match a particular row? In this case, every time that A matches, D also matches.
This case is kind of a base case; really an instance of "so don't do that!" You can trivially rewrite the pattern as "(A B C A)", replace all instances of "D" with "A", and you will find that the pattern does match the contents of the table.
Why? Because MATCH computes a match string, not a match graph.
Match clauses can be anything. For example, what if you had "D as event LIKE 'e%'". This will, of course, match any row that also matches A, B, or C. These are extremely simple examples; both the match clauses and the patterns can get much more complex. Once you allow for the possibility of duplicate patterns matching a particular row, possibly in the middle of a long complex pattern, etc., the complexity of the resulting analysis increases substantially.
So, we make a performance trade-off: MATCH is specified to only consider the first match-clause that matches any given row. So, "D as event LIKE 'e%'" would only match all 'e%'s that weren't matched by preceding clauses, ie., 'e1', 'e3', or 'e2'.
(Note that, if your expressions are all independent and if one is improbable and expensive to evaluate, this also lets you avoid computing the improbable one most of the time by moving it to the end of the list. This optimization is not possible if we must consider every match clause as a possible match for every row. Again, all about performance in exchange for flexibility.)
Think of it like converting your table into a string; one character per row in the table. The first row above matches the first match-clause, so we append 'A' to the string. The second row matches the second match-clause, so we append 'B'. The third matches the third; so, 'C'. The fourth ... well, starting from the top, the first matching clause is 'A', so we append 'A'. Having scanned the whole table, we end up with "ABCA". Does that match the pattern "(ABCD)"? No. So, no match.
Both variants of this syntax have their uses. Right now, we only support the one described above. If you would like the other, feel free to post to the "Ideas" section of the site.
My bad, I confused because: http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Statements/SELECT/MATCH...
actually, it isn't true, PATTERN behaves like SWTCH-CASE(C/C++) with break and of cause will work only a first CASE with break.
the calculation works perfectly and for a billion records table
we get the expected response in just 3 seconds for every shop
Great!
The final code is below:
SELECT ID_Pdv, ID_Articolo, ID_Data ,
VAL_QuantitàDiffInventario,VAL_QuantitàAcquisto , VAL_QuantitàVendita , VAL_QuantitàDistruzioni, VAL_QuantitàVenditaPreInv,
LAG(Giacenza_Finale,1,0) OVER(partition by ID_Pdv, ID_Articolo order by ID_Data asc) as Giacenza_Iniziale ,
Giacenza_Finale
INTO MEGAMONITOR.dt_pdv_fatti_aggr_day_gia"+context.casual+"
FROM
(SELECT ID_Pdv, ID_Articolo, ID_Data, VAL_QuantitàDiffInventario,VAL_QuantitàAcquisto , VAL_QuantitàVendita , VAL_QuantitàDistruzioni, VAL_QuantitàVenditaPreInv,
SUM(dati.VAL_QuantitàDiffInventario + VAL_QuantitàAcquisto - VAL_QuantitàVendita - VAL_QuantitàDistruzioni)
OVER(PARTITION BY ID_Pdv, ID_Articolo, p_id ORDER BY ID_Data ) Giacenza_Finale
FROM
(SELECT f.ID_Pdv, ID_Articolo, f.ID_Data,
f.VAL_QuantitàDiffInventario,f.VAL_QuantitàAcquisto ,
case when f.VAL_QuantitàDiffInventario >0 and f.VAL_QuantitàVendita >0 then 0 else f.VAL_QuantitàVendita end as VAL_QuantitàVendita,
case when f.VAL_QuantitàDiffInventario >0 and f.VAL_QuantitàVendita >0 then f.VAL_QuantitàVendita else 0 end as VAL_QuantitàVenditaPreInv,
f.VAL_QuantitàDistruzioni, pattern_id() as p_id
FROM MEGAMONITOR.DT_PDV_FATTI_AGGR f INNER JOIN MEGAMONITOR.AN_PDV a ON f.ID_Pdv=a.ID_Pdv
WHERE f.ID_Data between '"+context.data+"' and '"+context.datafi+"' and f.ID_Pdv like '"+context.cliente+"' and a.FL_Diretto like '"+context.diretto+"'
MATCH (partition by f.ID_Pdv, ID_Articolo order by f.ID_Data
DEFINE
inventory as f.VAL_QuantitàDiffInventario > 0,
no_inventory as f.VAL_QuantitàDiffInventario = 0
PATTERN
P AS (inventory* no_inventory*))
) dati
) giacenza
order by ID_Pdv, ID_Articolo , ID_Data