Speeding up the Median Function

Hi,

I am trying to use the MEDIAN function to calculate median of a column in a Vertica table The following query takes around 75 seconds on a table of 6+ million rows and takes more than 480 seconds (8 minutes) on a view with the same number of rows created by merging two tables.

I have given the query below:

SELECT count(*) as Frequency,       
               MAX(time) as MaxTime,
               AVG(time) as AverageTime,
               MAX(MedianCal) as MedianTime
 FROM (SELECT time,MEDIAN(time) OVER () as MedianCal FROM table1) AS T1

 Is there a way to speed up the median calculation?

Thank you.

Ravi

Comments

  • There's no PARTITION expression on the analytic. So, it's doing a full table analysis in order to calculate MEDIAN(time). That could be expensive, depending on how big this table is.

    What projection is the EXPLAIN plan referencing, and how is it ordered? That could make a big difference, too.  Are there any NULL values in time?

  • Prasanta_PalPrasanta_Pal - Select Field - Employee
    As Curtis mentioned that you did not put any expression in OVER clause, so it is doing full table analysis, if you have some clause like OVER(PARTITION by ...) will calculate MEDIAN based on the PARTITION clause.

    You said "takes more than 480 seconds (8 minutes) on a view with the same number of rows created by merging two tables" - what is the view definition you created?
    You may compare the EXPLAIN plan between the two queries.

    The query you pasted,  what is output of the query and EXPLAIN plan?

  • Hi Curtis and Prasanta,
    Thank you so much for your replies. I intended to calculate the Median for the entire table and that's why I didn't provide any PARTITION BY in the OVER clause.

    There are no NULL values in the time column.

    The run times I have quoted in the first post seems bloated since the subsequent runs are taking much less time. I have given the updated times below:

    I have provided the query used to create the view below:

    CREATE VIEW view1 as
    SELECT a.id, a.count,a.date,...
    FROM tableA a 
    JOIN
    tableB b
    ON (a.id=b.id) AND (a.count=b.count) AND (a.date=b.date)
    LEFT JOIN
    tableC c
    ON (a.id=c.id) AND (a.count=c.count) AND (a.date=c.date);

    Partition has been done on variable dt. tableA and tableC have been ordered by id, count,date whereas tableB has been ordered by id,count,count_number,dt.

    tableA has 6435486 rows (6.43 million),tableB has 20665507 rows (~20 million) and tableC has 6358059 (6.35 million). The generated view contains 20665478 rows (~20 million). 

    The median(time) calculation on tableA takes around 6 seconds. tableB doesn't contain time variable but median run on an equivalent variable takes around 20 seconds.

    However, median(time) on the generated view takes around 235 seconds.

    The EXPLAIN PLAN for the view is given below:

    QUERY PLAN
    ------------------------------
    QUERY PLAN DESCRIPTION:
    ------------------------------

    EXPLAIN SELECT COUNT(*) FROM view1

    Access Path:
    +-GROUPBY NOTHING [Cost: 4M, Rows: 1 (NO STATISTICS)] (PATH ID: 1)
    |  Aggregates: count(*)
    | +---> JOIN HASH [RightOuter] [Cost: 4M, Rows: 133K (NO STATISTICS)] (PATH ID: 3)
    | |      Join Cond: (a.id = c.id) AND (a.count = c.count) AND (a.date = c.date)
    | |      Materialize at Input: b.id, b.count, b.date
    | | +-- Outer -> STORAGE ACCESS for c [Cost: 90K, Rows: 6M (NO STATISTICS)] (PATH ID: 4)
    | | |      Projection: tableC_super
    | | |      Materialize: c.id, c.count, c.date
    | | |      Runtime Filters: (SIP1(HashJoin): c.id), (SIP2(HashJoin): c.count), (SIP3(HashJoin): c.date), (SIP4(HashJoin): c.id, c.count, c.date)
    | | +-- Inner -> JOIN HASH [Cost: 4M, Rows: 133K (NO STATISTICS)] (PATH ID: 5)
    | | |      Join Cond: (a.id = b.id) AND (a.count = b.count) AND (a.date = b.date)
    | | | +-- Outer -> STORAGE ACCESS for b [Cost: 260K, Rows: 21M (NO STATISTICS)] (PATH ID: 6)
    | | | |      Projection: tableB_super
    | | | |      Materialize: b.id, b.count, b.date
    | | | |      Runtime Filters: (SIP5(HashJoin): b.id), (SIP6(HashJoin): b.count), (SIP7(HashJoin): b.date), (SIP8(HashJoin): b.id, b.count, b.date)
    | | | +-- Inner -> STORAGE ACCESS for a [Cost: 91K, Rows: 6M (NO STATISTICS)] (PATH ID: 7)
    | | | |      Projection: tableA_super
    | | | |      Materialize: a.id, a.count, a.date


    ------------------------------
    -----------------------------------------------
    PLAN: BASE QUERY PLAN (GraphViz Format)
    -----------------------------------------------
    digraph G {
    graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: EXPLAIN SELECT COUNT(*) FROM \"view1\"\n\nAll Nodes Vector: \n\n  node[0]=v_test_node0001 (initiator) Up\n", labelloc=t, labeljust=l ordering=out]
    0[label = "Root \nOutBlk=[UncTuple]", color = "green", shape = "house"];
    1[label = "NewEENode \nOutBlk=[UncTuple]", color = "green", shape = "box"];
    2[label = "GroupByPipe: 0 keys\nAggs:\n  count(*)\nUnc: Integer(8)", color = "green", shape = "box"];
    3[label = "ExprEval: \n  1\nUnc: Integer(8)", color = "green", shape = "box"];
    4[label = "ExprEval: \n  a.id\nUnc: Varchar(200)", color = "green", shape = "box"];
    5[label = "Join: Hash-Join: \n(tableC x tableA) using tableC_super and previous join (PATH ID: 3)\n[RightOuter]\n\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "green", shape = "box"];
    6[label = "StorageUnionStep: tableC_super\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "purple", shape = "box"];
    7[label = "ScanStep: tableC_super\nSIP1(HashJoin): c.id\nSIP2(HashJoin): c.count\nSIP3(HashJoin): c.date\nSIP4(HashJoin): c.id, c.count, c.date\nid\ncount\ndate\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "brown", shape = "box"];
    8[label = "StorageUnionStep: tableB_super\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "purple", shape = "box"];
    9[label = "Join: Hash-Join: \n(tableB x tableA) using tableB_super and tableA_super (PATH ID: 5)\n\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "brown", shape = "box"];
    10[label = "ScanStep: tableB_super\nSIP5(HashJoin): b.id\nSIP6(HashJoin): b.count\nSIP7(HashJoin): b.date\nSIP8(HashJoin): b.id, b.count, b.date\nid\ncount\ndate\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "brown", shape = "box"];
    11[label = "StorageUnionStep: tableA_super\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "purple", shape = "box"];
    12[label = "ScanStep: tableA_super\nid\ncount\ndate\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "brown", shape = "box"];
    1->0 [label = "V[0]",color = "black"];
    2->1 [label = "0",color = "blue"];
    3->2 [label = "0",color = "blue"];
    4->3 [label = "0",color = "blue"];
    5->4 [label = "0",color = "blue"];
    6->5 [label = "0",color = "blue"];
    7->6 [label = "0",color = "blue"];
    8->5 [label = "1",color = "blue"];
    9->8 [label = "0",color = "blue"];
    10->9 [label = "0",color = "blue"];
    11->9 [label = "1",color = "blue"];
    12->11 [label = "0",color = "blue"];
    }

    Could you please let me know if there is a way to optimize the view generation query and speed up the median calculation on the view generated?

    Thank you.

    Ravi

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file