SUM, AVERAGE, COUNT, STDEV across columns?

edited November 2022 in General Discussion

is there a concise way in Vertica to select metrics across a set of columns or expressions within a row, analogous to the way that GREATEST and LEAST work?

Goal

Specifically I'd like efficient reusable function to calculate a sum, average, count, variance across an arbitrary list of expressions for each row:

E.g.

SELECT 
    userid, 
    SUM(a1, a2, a3, a4, a5, a6) AS a_sum,
    COUNT(a1, a2, a3, a4, a5, a6) AS a_sum,
    AVG(a1, a2, a3, a4, a5, a6) AS a_avg,
    VARIANCE(a1, a2, a3, a4, a5, a6) AS a_variance,
FROM table

The best I can think of is to write this out atomically like in this suggestion. This is ok for sums, but gets very pedantic for higher things like average and variance esp when there can be null values.
```

Custom function

I suspect this does not exist in standard SQL or vSQL, and may be best implemented as a custom function.

Q1. Can these be implemented in as user-defined SQL functions? It would be nice to avoid C++. But I don't know how to iterate over a list of arguments

Q2. Is there a library of existing community-provided UDFs to search if this may already exist and/or contribute such function later? I see this GitHub repo: https://github.com/vertica/UDx-Examples/tree/master/Java-and-C++

https://stackoverflow.com/questions/15719557/how-to-compute-the-sum-of-multiple-columns-in-postgresql

Best Answers

Answers

  • moshegmosheg Vertica Employee Administrator
    edited November 2022

    Please consider the following example in pure Vertica SQL.

    CREATE TABLE my_table (userid int, a1 int, a2 int, a3 int, a4 int, a5 int, a6 int) order by userid;
    
    COPY my_table FROM STDIN DELIMITER ',' NULL AS 'NULL' ABORT ON ERROR;
    1,1,2,3,4,5,6,
    2,7,8,9,10,11,NULL,
    3,12,13,14,15,NULL,16,
    \.
    
    SELECt * FROM my_table ORDER BY 1;
    --  userid | a1 | a2 | a3 | a4 | a5 | a6
    -- --------+----+----+----+----+----+----
    --       1 |  1 |  2 |  3 |  4 |  5 |  6
    --       2 |  7 |  8 |  9 | 10 | 11 |
    --       3 | 12 | 13 | 14 | 15 |    | 16
    -- (3 rows)
    
    SELECT
           userid,
           APPLY_COUNT(ARRAY[a1,a2,a3,a4,a5,a6]) AS COUNT,
           APPLY_AVG(ARRAY[a1,a2,a3,a4,a5,a6]) AS AVG
    FROM my_table
    ORDER BY 1;
    --  userid | COUNT | AVG
    -- --------+-------+-----
    --       1 |     6 | 3.5
    --       2 |     5 |   9
    --       3 |     5 |  14
    -- (3 rows)
    
    SELECT EXPLODE(userid,ARRAY[a1,a2,a3,a4,a5,a6]) OVER(PARTITION BEST)
    FROM my_table
    ORDER BY 1;
    --  userid | position | value
    -- --------+----------+-------
    --       1 |        0 |     1
    --       1 |        1 |     2
    --       1 |        2 |     3
    --       1 |        3 |     4
    --       1 |        4 |     5
    --       1 |        5 |     6
    --       2 |        0 |     7
    --       2 |        1 |     8
    --       2 |        2 |     9
    --       2 |        3 |    10
    --       2 |        4 |    11
    --       2 |        5 |
    --       3 |        0 |    12
    --       3 |        1 |    13
    --       3 |        2 |    14
    --       3 |        3 |    15
    --       3 |        4 |
    --       3 |        5 |    16
    -- (18 rows)
    
    WITH
    list1 AS (SELECT EXPLODE(userid,ARRAY[a1,a2,a3,a4,a5,a6]) OVER(PARTITION BEST) FROM my_table)
    SELECT userid, SUM(value), COUNT(value), AVG(value), VARIANCE(value), STDDEV(value)
    FROM list1
    GROUP BY userid
    ORDER BY 1;
    --  userid | SUM | COUNT | AVG | VARIANCE |      STDDEV
    -- --------+-----+-------+-----+----------+------------------
    --       1 |  21 |     6 | 3.5 |      3.5 | 1.87082869338697
    --       2 |  45 |     5 |   9 |      2.5 | 1.58113883008419
    --       3 |  70 |     5 |  14 |      2.5 | 1.58113883008419
    -- (3 rows)
    
  • Aha, the Collection functions are exactly the concept I was seeking. Thank you!
    These work perfectly for this purpose.

    Do you know if there is an APPLY_VARIANCE or APPLY_STDDEV or equivalent?

  • @mosheg The function to iterate over arrays via EXPLODE(x) OVER (PARTITION BY ..) is amazing

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file