SUM, AVERAGE, COUNT, STDEV across columns?
is there a concise way in Vertica to select metrics across a set of columns or expressions within a row, analogous to the way that GREATEST and LEAST work?
Goal
Specifically I'd like efficient reusable function to calculate a sum, average, count, variance across an arbitrary list of expressions for each row:
E.g.
SELECT userid, SUM(a1, a2, a3, a4, a5, a6) AS a_sum, COUNT(a1, a2, a3, a4, a5, a6) AS a_sum, AVG(a1, a2, a3, a4, a5, a6) AS a_avg, VARIANCE(a1, a2, a3, a4, a5, a6) AS a_variance, FROM table
The best I can think of is to write this out atomically like in this suggestion. This is ok for sums, but gets very pedantic for higher things like average and variance esp when there can be null values.
```
Custom function
I suspect this does not exist in standard SQL or vSQL, and may be best implemented as a custom function.
Q1. Can these be implemented in as user-defined SQL functions? It would be nice to avoid C++. But I don't know how to iterate over a list of arguments
Q2. Is there a library of existing community-provided UDFs to search if this may already exist and/or contribute such function later? I see this GitHub repo: https://github.com/vertica/UDx-Examples/tree/master/Java-and-C++
Best Answers
-
mosheg Vertica Employee Administrator
Q - Do you know if there is an APPLY_VARIANCE or APPLY_STDDEV or equivalent?
A - No. The existing Collection Functions are shown here
https://www.vertica.com/docs/12.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Array/ArrayFunctions.htmFor VARIANCE and STDDEV you can use the last query example or create a temporary table or create a stored procedure.
You can also open a support case, explain the business need and suggest it as a NFR.0
Answers
Please consider the following example in pure Vertica SQL.
Aha, the Collection functions are exactly the concept I was seeking. Thank you!
These work perfectly for this purpose.
Do you know if there is an APPLY_VARIANCE or APPLY_STDDEV or equivalent?
@mosheg The function to iterate over arrays via
EXPLODE(x) OVER (PARTITION BY ..)
is amazing