Options

can an analytic query filter window clause based on fields in the current row?

# i have a vertica 7.0.1 table which captures stats on network traffic
# each record is the sum of bytes in all packets between 2 addresses over a time interval (say 10 seconds) which ends at time_t
# time_t is seconds since the epoch, 01-01-1970

CREATE TABLE network_stats(src_ip INT, dst_ip INT, time_t INT, bytes INT);

# ... watch the network traffic and  fill the table for a while

# now find the sum of bytes over a period 5 minutes before through 5 minutes after each record for which the dst_ip is the same as the current record

SELECT a.time_t, a.src_ip, a.dst_ip,
    (SELECT SUM(bytes) as tot FROM network_stats b WHERE time_t BETWEEN (a.time_t - 300) AND (a.time_t + 300) AND (b.dst_ip = a.dst_ip)) AS dst
FROM network_stats a;

ERROR 4160:  Non-equality correlated subquery expression is not supported

OK, can this be done with an analytic query? The question is how to get criteria from the current row (the dst_ip) into the window range.

Thanks for help,   marcus


Comments

  • Options
    Hi!

    It can be done in couple ways. With minimal changes you need to use in ANALYTICAL SUM:
    http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/Analytic/SUMA...


    where PARTITION BY is
    time_t BETWEEN (time_t - 300) AND (time_t + 300)
    ORDER BY
    time_t
    And you may need to play with window_frame_clause
    http://my.vertica.com/docs/7.1.x/HTML/index.htm#Authoring/SQLReferenceManual/Functions/Analytic/wind...
  • Options
    Hi Daniel,
    thanks for the reply, and apologies if i wasn't clear. getting the partition by is not the issue I've run into, it's how to add an additional condition from the current row. 
    so in pseudo code I'm asking

    where PARTITION BY is
    time_t BETWEEN (time_t - 300) AND (time_t + 300) {AND ONLY INCLUDE dst_ip of the current row }
    A correlated subquery like this doesn't work because the condition needs to be applied in the analytic clause, not the outer WHERE, which is evaluated first:

    SELECT a.time_t, a.src_ip, a.dst_ip,
        (SELECT SUM(bytes) 
        OVER (PARTITION BY b.time_t BETWEEN (b.time_t - 300) AND (b.time_t + 300) ORDER BY b.time_t) AS SUM
        FROM network_stats b
        WHERE b.src_ip = a.src_ip)
    FROM pair_stats_flat_queue a;


    I am able to do the operation in 2 queries like this:

    SELECT time_t, src_ip, SUM(bytes) AS bytes
    INTO src_aggr
    FROM network_stats
    WHERE src_ip IS NOT NULL AND dst_ip IS NOT NULL
    GROUP BY time_t, src_ip
    ORDER BY time_t DESC, src_ip;

    SELECT src_ip, time_t,
    SUM(bytes) OVER (PARTITION BY src_ip ORDER BY time_t RANGE BETWEEN 300 PRECEDING AND 300 FOLLOWING) AS src_window_sum
    FROM src_aggr
    ORDER BY time_t DESC;

    but is there a way to reduce this to a single query?

  • Options
    Hi 
    Try one of the below options :

    select a.time_t, a.src_ip, a.dst_ip,SUM(bytes) over (partition BY time_slice(time_t,300, 'second','START') +
    SUM(bytes) over (partition BY time_slice(time_t,300, 'second','END')  AS dst
    FROM network_stats a;

    or just 

    select a.time_t, a.src_ip, a.dst_ip,
    SUM(bytes) over (partition BY time_slice(time_t,600, 'second')  AS dst
    FROM network_stats a;


  • Options
    Hi Eli,

    Thanks for the reply. If the second query you suggest is modified to:
    SELECT a.time_t, a.src_ip, a.dst_ip,
    SUM(bytes) over ( partition BY time_slice(TIMESTAMP WITH TIME ZONE 'EPOCH' + (time_t-300) * INTERVAL '1 second',600, 'SECOND') ) AS dst
    FROM network_stats a ;

    it will get the 5 minutes before and after the current record, but it still doesn't address the additional aggregation by ip over the moving window as the 2 query solution I mentioned above does.


  • Options
    I am sure i understend ! Do you try to add to the patition by cluse the ip in addtion to the time_slice function ?
  • Options
    Hi!

    Can you clarify it more?:

    You are trying to calculate TRAFFIC THROUGHPUT for current second, last 5 min and for last 15 mins, like CPU USAGE (load average) in Linux?  (last 1min, 5 min or 15 mins - it doesn't matter, its same algorithm).
    daniel@xps:~$ uptime 
    12:39:55 up 24 min, 2 users, load average: 0.09, 0.12, 0.13

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file