SQL help with finding peaks across like values in time series data
Hi, I'm trying to understand whether I should use more traditional techniques like subqueries and groupings or more advanced ones like time_slice and windowed partitioning for what seems like a fairly straightforward use-case in the analytics of time-series data coming from our system.
Like a lot of people, we have bytes measured (ingress and egress) coming from lots of clusters and I'm trying to find the total bytes per time interval (normally mb per day) and peaks for each of these clusters.
I make monthly views out of our raw flex data that I'm working which have the following structure:
CREATE VIEW egress_data_2018_12 AS SELECT distinct(id), name, value, ts, cluster_id, metric_name, server_name, user_id, tenant FROM transferstats_2018 WHERE ts >= '2018-12-01 00:00:00' AND ts < '2019-01-01 00:00:00' AND metric_name = 'server_tenant_metrics_response_bytes';
(ingress is the same except a different metric_name and a different view name)
It's not very pretty, but I have a working select statement that leverages subqueries to get every days' total usage across a given cluster within the month:
SELECT x.cluster_id, x.tenant, round(sum(value) / 1024 / 1024, 2) as total_mb_egressed, ( SELECT round(sum(value) / 1024 / 1024, 2) from egress_data_2018_12 y where y.tenant = x.tenant AND ts >= '2018-12-01 00:00:00' AND ts < '2018-12-02 00:00:00' ) mb_12_01, ( SELECT round(sum(value) / 1024 / 1024, 2) from egress_data_2018_12 y where y.tenant = x.tenant AND ts >= '2018-12-02 00:00:00' AND ts < '2018-12-03 00:00:00' ) mb_12_02 -- etc... repeat for 29 more days FROM egress_data_2018_12 x WHERE x.ts >= '2018-12-01 00:00:00' AND x.ts < '2018-12-03 00:00:00' GROUP BY x.cluster_id, x.tenant, mb_12_01, mb_12_02 ORDER BY mb_12_01 desc;
Note that the reason this works is because all of the values for each server_name and user_id are getting summed together for a given tenant/cluster_id tuple, and that's exactly what we're looking for.
Now, while this works I'm sure it could be done more elegantly than either writing out or even generating all that SQL. But the real problem is I can't even figure out how to get the other statistic that I want, which is a peak for each day - basically finding the minute with the most bytes transferred for each cluster/tenant tuple (again, summed over all servers and users), and exposing that value at the hour, day, and/or monthly level.
From what I've read it sounds like timeslices might be able to help me, but it also sounds like this might be a good fit for traditional windowing - but I figured before heading down the wrong path I'd just ask here for some assistance if anyone's done something like this before.
Comments
I think time series can help. If only to make your query a little less unwieldy. Having said that, I'm going to post an example of a time series query that's going to make your query look simple in comparison - but don't let the length fool you. It's actually a rather simple query.
`--this unwieldy query just produces a report of all the CPU utilization numbers across nodes.
--it can useful if one node has CPU spikes for some weird reason, this will help isolate that possibility
--it assumes node 1 ends with %0001, and looks at up to 6 nodes. With some brute-force copy/paste you can expand this up to whatever
--it can also run on less than 6 nodes, since everything beyond the first is an outer join.
--you can change the interval if you like; the default is 1 minute. Last line of the SQL
--you can also change the date range. Currently just looks at current date
--just dump the results in your favorite graphing tool to visualize it. I use Excel
select _time
,to_char((n1_user+n1_nice+n1_system+n1_io_wait+n1_irq+n1_soft+n1_steal+n1_guest) / 1000000, '999.999') as n1_cpu
,to_char((n2_user+n2_nice+n2_system+n2_io_wait+n2_irq+n2_soft+n2_steal+n2_guest) / 1000000, '999.999') as n2_cpu
,to_char((n3_user+n3_nice+n3_system+n3_io_wait+n3_irq+n3_soft+n3_steal+n3_guest) / 1000000, '999.999') as n3_cpu
,to_char((n4_user+n4_nice+n4_system+n4_io_wait+n4_irq+n4_soft+n4_steal+n4_guest) / 1000000, '999.999') as n4_cpu
,to_char((n5_user+n5_nice+n5_system+n5_io_wait+n5_irq+n5_soft+n5_steal+n5_guest) / 1000000, '999.999') as n5_cpu
,to_char((n6_user+n6_nice+n6_system+n6_io_wait+n6_irq+n6_soft+n6_steal+n6_guest) / 1000000, '999.999') as n6_cpu
from ( select slice_time as _time
, ts_first_value(n1.user_microseconds_peak_delta, 'LINEAR') as n1_user
, ts_first_value(n2.user_microseconds_peak_delta IGNORE NULLS, 'LINEAR') as n2_user
, ts_first_value(n3.user_microseconds_peak_delta IGNORE NULLS, 'LINEAR') as n3_user
, ts_first_value(n4.user_microseconds_peak_delta IGNORE NULLS, 'LINEAR') as n4_user
, ts_first_value(n5.user_microseconds_peak_delta IGNORE NULLS, 'LINEAR') as n5_user
, ts_first_value(n6.user_microseconds_peak_delta IGNORE NULLS, 'LINEAR') as n6_user
`
You could change the last value to "1 day" to do a 1 day slice. And you might not need to use LINEAR interpolation, since you might not have missing values.
To get the peak minute, you might need to use an analytic function, since the data doesn't actually exist in your table - it exists in your result. You can use traditional max, sum, count function - but to use them as analytic you'd supply an OVER() clause to it. Have you used analytics before?
Thanks @Vertica_Curtis for taking the time to reply. I think your query formatting might've gotten messed up. Are you saying something like your query could be the input for my peaks query?
I haven't used any vertica analytics features explicitly to my knowledge, no.