can an analytic query filter window clause based on fields in the current row?
# i have a vertica 7.0.1 table which captures stats on network traffic
# each record is the sum of bytes in all packets between 2 addresses over a time interval (say 10 seconds) which ends at time_t
# time_t is seconds since the epoch, 01-01-1970
CREATE TABLE network_stats(src_ip INT, dst_ip INT, time_t INT, bytes INT);
# ... watch the network traffic and fill the table for a while
# now find the sum of bytes over a period 5 minutes before through 5 minutes after each record for which the dst_ip is the same as the current record
SELECT a.time_t, a.src_ip, a.dst_ip,
(SELECT SUM(bytes) as tot FROM network_stats b WHERE time_t BETWEEN (a.time_t - 300) AND (a.time_t + 300) AND (b.dst_ip = a.dst_ip)) AS dst
FROM network_stats a;
ERROR 4160: Non-equality correlated subquery expression is not supported
OK, can this be done with an analytic query? The question is how to get criteria from the current row (the dst_ip) into the window range.
Thanks for help, marcus
# each record is the sum of bytes in all packets between 2 addresses over a time interval (say 10 seconds) which ends at time_t
# time_t is seconds since the epoch, 01-01-1970
CREATE TABLE network_stats(src_ip INT, dst_ip INT, time_t INT, bytes INT);
# ... watch the network traffic and fill the table for a while
# now find the sum of bytes over a period 5 minutes before through 5 minutes after each record for which the dst_ip is the same as the current record
SELECT a.time_t, a.src_ip, a.dst_ip,
(SELECT SUM(bytes) as tot FROM network_stats b WHERE time_t BETWEEN (a.time_t - 300) AND (a.time_t + 300) AND (b.dst_ip = a.dst_ip)) AS dst
FROM network_stats a;
ERROR 4160: Non-equality correlated subquery expression is not supported
OK, can this be done with an analytic query? The question is how to get criteria from the current row (the dst_ip) into the window range.
Thanks for help, marcus
0
Comments
It can be done in couple ways. With minimal changes you need to use in ANALYTICAL SUM:
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/Analytic/SUMA...
where PARTITION BY is ORDER BY And you may need to play with window_frame_clause
http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/Analytic/wind...
thanks for the reply, and apologies if i wasn't clear. getting the partition by is not the issue I've run into, it's how to add an additional condition from the current row.
so in pseudo code I'm asking
where PARTITION BY is A correlated subquery like this doesn't work because the condition needs to be applied in the analytic clause, not the outer WHERE, which is evaluated first:
SELECT a.time_t, a.src_ip, a.dst_ip,
(SELECT SUM(bytes)
OVER (PARTITION BY b.time_t BETWEEN (b.time_t - 300) AND (b.time_t + 300) ORDER BY b.time_t) AS SUM
FROM network_stats b
WHERE b.src_ip = a.src_ip)
FROM pair_stats_flat_queue a;
I am able to do the operation in 2 queries like this:
SELECT time_t, src_ip, SUM(bytes) AS bytes
INTO src_aggr
FROM network_stats
WHERE src_ip IS NOT NULL AND dst_ip IS NOT NULL
GROUP BY time_t, src_ip
ORDER BY time_t DESC, src_ip;
SELECT src_ip, time_t,
SUM(bytes) OVER (PARTITION BY src_ip ORDER BY time_t RANGE BETWEEN 300 PRECEDING AND 300 FOLLOWING) AS src_window_sum
FROM src_aggr
ORDER BY time_t DESC;
but is there a way to reduce this to a single query?
Try one of the below options :
select a.time_t, a.src_ip, a.dst_ip,SUM(bytes) over (partition BY time_slice(time_t,300, 'second','START') +
SUM(bytes) over (partition BY time_slice(time_t,300, 'second','END') AS dst
FROM network_stats a;
or just
select a.time_t, a.src_ip, a.dst_ip,
SUM(bytes) over (partition BY time_slice(time_t,600, 'second') AS dst
FROM network_stats a;
Thanks for the reply. If the second query you suggest is modified to:
SELECT a.time_t, a.src_ip, a.dst_ip,
SUM(bytes) over ( partition BY time_slice(TIMESTAMP WITH TIME ZONE 'EPOCH' + (time_t-300) * INTERVAL '1 second',600, 'SECOND') ) AS dst
FROM network_stats a ;
it will get the 5 minutes before and after the current record, but it still doesn't address the additional aggregation by ip over the moving window as the 2 query solution I mentioned above does.
Can you clarify it more?:
You are trying to calculate TRAFFIC THROUGHPUT for current second, last 5 min and for last 15 mins, like CPU USAGE (load average) in Linux? (last 1min, 5 min or 15 mins - it doesn't matter, its same algorithm).