SQL help with finding peaks across like values in time series data
Hi, I'm trying to understand whether I should use more traditional techniques like subqueries and groupings or more advanced ones like time_slice and windowed partitioning for what seems like a fairly straightforward use-case in the analytics of time-series data coming from our system.
Like a lot of people, we have bytes measured (ingress and egress) coming from lots of clusters and I'm trying to find the total bytes per time interval (normally mb per day) and peaks for each of these clusters.
I make monthly views out of our raw flex data that I'm working which have the following structure:
CREATE VIEW egress_data_2018_12 AS SELECT distinct(id), name, value, ts, cluster_id, metric_name, server_name, user_id, tenant FROM transferstats_2018 WHERE ts >= '2018-12-01 00:00:00' AND ts < '2019-01-01 00:00:00' AND metric_name = 'server_tenant_metrics_response_bytes';
(ingress is the same except a different metric_name and a different view name)
It's not very pretty, but I have a working select statement that leverages subqueries to get every days' total usage across a given cluster within the month:
SELECT x.cluster_id, x.tenant, round(sum(value) / 1024 / 1024, 2) as total_mb_egressed, ( SELECT round(sum(value) / 1024 / 1024, 2) from egress_data_2018_12 y where y.tenant = x.tenant AND ts >= '2018-12-01 00:00:00' AND ts < '2018-12-02 00:00:00' ) mb_12_01, ( SELECT round(sum(value) / 1024 / 1024, 2) from egress_data_2018_12 y where y.tenant = x.tenant AND ts >= '2018-12-02 00:00:00' AND ts < '2018-12-03 00:00:00' ) mb_12_02 -- etc... repeat for 29 more days FROM egress_data_2018_12 x WHERE x.ts >= '2018-12-01 00:00:00' AND x.ts < '2018-12-03 00:00:00' GROUP BY x.cluster_id, x.tenant, mb_12_01, mb_12_02 ORDER BY mb_12_01 desc;
Note that the reason this works is because all of the values for each server_name and user_id are getting summed together for a given tenant/cluster_id tuple, and that's exactly what we're looking for.
Now, while this works I'm sure it could be done more elegantly than either writing out or even generating all that SQL. But the real problem is I can't even figure out how to get the other statistic that I want, which is a peak for each day - basically finding the minute with the most bytes transferred for each cluster/tenant tuple (again, summed over all servers and users), and exposing that value at the hour, day, and/or monthly level.
From what I've read it sounds like timeslices might be able to help me, but it also sounds like this might be a good fit for traditional windowing - but I figured before heading down the wrong path I'd just ask here for some assistance if anyone's done something like this before.