SUM, AVERAGE, COUNT, STDEV across columns?

Pieter_Sheth-Vo · November 2022

is there a concise way in Vertica to select metrics across a set of columns or expressions within a row, analogous to the way that GREATEST and LEAST work?

Goal

Specifically I'd like efficient reusable function to calculate a sum, average, count, variance across an arbitrary list of expressions for each row:

E.g.

SELECT 
    userid, 
    SUM(a1, a2, a3, a4, a5, a6) AS a_sum,
    COUNT(a1, a2, a3, a4, a5, a6) AS a_sum,
    AVG(a1, a2, a3, a4, a5, a6) AS a_avg,
    VARIANCE(a1, a2, a3, a4, a5, a6) AS a_variance,
FROM table

The best I can think of is to write this out atomically like in this suggestion. This is ok for sums, but gets very pedantic for higher things like average and variance esp when there can be null values.
```

Custom function

I suspect this does not exist in standard SQL or vSQL, and may be best implemented as a custom function.

Q1. Can these be implemented in as user-defined SQL functions? It would be nice to avoid C++. But I don't know how to iterate over a list of arguments

Q2. Is there a library of existing community-provided UDFs to search if this may already exist and/or contribute such function later? I see this GitHub repo: https://github.com/vertica/UDx-Examples/tree/master/Java-and-C++

https://stackoverflow.com/questions/15719557/how-to-compute-the-sum-of-multiple-columns-in-postgresql

Pieter_Sheth-Vo · November 2022

This is perfect. Thank you!

mosheg · November 2022

Q - Do you know if there is an APPLY_VARIANCE or APPLY_STDDEV or equivalent?
A - No. The existing Collection Functions are shown here
https://www.vertica.com/docs/12.0.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Array/ArrayFunctions.htm

For VARIANCE and STDDEV you can use the last query example or create a temporary table or create a stored procedure.
You can also open a support case, explain the business need and suggest it as a NFR.

mosheg · November 2022

Please consider the following example in pure Vertica SQL.

CREATE TABLE my_table (userid int, a1 int, a2 int, a3 int, a4 int, a5 int, a6 int) order by userid;

COPY my_table FROM STDIN DELIMITER ',' NULL AS 'NULL' ABORT ON ERROR;
1,1,2,3,4,5,6,
2,7,8,9,10,11,NULL,
3,12,13,14,15,NULL,16,
\.

SELECt * FROM my_table ORDER BY 1;
--  userid | a1 | a2 | a3 | a4 | a5 | a6
-- --------+----+----+----+----+----+----
--       1 |  1 |  2 |  3 |  4 |  5 |  6
--       2 |  7 |  8 |  9 | 10 | 11 |
--       3 | 12 | 13 | 14 | 15 |    | 16
-- (3 rows)

SELECT
       userid,
       APPLY_COUNT(ARRAY[a1,a2,a3,a4,a5,a6]) AS COUNT,
       APPLY_AVG(ARRAY[a1,a2,a3,a4,a5,a6]) AS AVG
FROM my_table
ORDER BY 1;
--  userid | COUNT | AVG
-- --------+-------+-----
--       1 |     6 | 3.5
--       2 |     5 |   9
--       3 |     5 |  14
-- (3 rows)

SELECT EXPLODE(userid,ARRAY[a1,a2,a3,a4,a5,a6]) OVER(PARTITION BEST)
FROM my_table
ORDER BY 1;
--  userid | position | value
-- --------+----------+-------
--       1 |        0 |     1
--       1 |        1 |     2
--       1 |        2 |     3
--       1 |        3 |     4
--       1 |        4 |     5
--       1 |        5 |     6
--       2 |        0 |     7
--       2 |        1 |     8
--       2 |        2 |     9
--       2 |        3 |    10
--       2 |        4 |    11
--       2 |        5 |
--       3 |        0 |    12
--       3 |        1 |    13
--       3 |        2 |    14
--       3 |        3 |    15
--       3 |        4 |
--       3 |        5 |    16
-- (18 rows)

WITH
list1 AS (SELECT EXPLODE(userid,ARRAY[a1,a2,a3,a4,a5,a6]) OVER(PARTITION BEST) FROM my_table)
SELECT userid, SUM(value), COUNT(value), AVG(value), VARIANCE(value), STDDEV(value)
FROM list1
GROUP BY userid
ORDER BY 1;
--  userid | SUM | COUNT | AVG | VARIANCE |      STDDEV
-- --------+-----+-------+-----+----------+------------------
--       1 |  21 |     6 | 3.5 |      3.5 | 1.87082869338697
--       2 |  45 |     5 |   9 |      2.5 | 1.58113883008419
--       3 |  70 |     5 |  14 |      2.5 | 1.58113883008419
-- (3 rows)

Pieter_Sheth-Vo · November 2022

Aha, the Collection functions are exactly the concept I was seeking. Thank you!
These work perfectly for this purpose.

Do you know if there is an APPLY_VARIANCE or APPLY_STDDEV or equivalent?

Pieter_Sheth-Vo · November 2022

@mosheg The function to iterate over arrays via EXPLODE(x) OVER (PARTITION BY ..) is amazing

We're Moving!

Create My New Community Account Now

SUM, AVERAGE, COUNT, STDEV across columns?

Goal

Custom function

Best Answers

Answers

Leave a Comment