Speeding up the Median Function
Hi,
I am trying to use the MEDIAN function to calculate median of a column in a Vertica table The following query takes around 75 seconds on a table of 6+ million rows and takes more than 480 seconds (8 minutes) on a view with the same number of rows created by merging two tables.
I have given the query below:
SELECT count(*) as Frequency,
MAX(time) as MaxTime,
AVG(time) as AverageTime,
MAX(MedianCal) as MedianTime
FROM (SELECT time,MEDIAN(time) OVER () as MedianCal FROM table1) AS T1
Is there a way to speed up the median calculation?
Thank you.
Ravi
I am trying to use the MEDIAN function to calculate median of a column in a Vertica table The following query takes around 75 seconds on a table of 6+ million rows and takes more than 480 seconds (8 minutes) on a view with the same number of rows created by merging two tables.
I have given the query below:
SELECT count(*) as Frequency,
MAX(time) as MaxTime,
AVG(time) as AverageTime,
MAX(MedianCal) as MedianTime
FROM (SELECT time,MEDIAN(time) OVER () as MedianCal FROM table1) AS T1
Is there a way to speed up the median calculation?
Thank you.
Ravi
0
Comments
There's no PARTITION expression on the analytic. So, it's doing a full table analysis in order to calculate MEDIAN(time). That could be expensive, depending on how big this table is.
What projection is the EXPLAIN plan referencing, and how is it ordered? That could make a big difference, too. Are there any NULL values in time?
You said "takes more than 480 seconds (8 minutes) on a view with the same number of rows created by merging two tables" - what is the view definition you created?
You may compare the EXPLAIN plan between the two queries.
The query you pasted, what is output of the query and EXPLAIN plan?
Thank you so much for your replies. I intended to calculate the Median for the entire table and that's why I didn't provide any PARTITION BY in the OVER clause.
There are no NULL values in the time column.
The run times I have quoted in the first post seems bloated since the subsequent runs are taking much less time. I have given the updated times below:
I have provided the query used to create the view below:
CREATE VIEW view1 as
SELECT a.id, a.count,a.date,...
FROM tableA a
JOIN
tableB b
ON (a.id=b.id) AND (a.count=b.count) AND (a.date=b.date)
LEFT JOIN
tableC c
ON (a.id=c.id) AND (a.count=c.count) AND (a.date=c.date);
Partition has been done on variable dt. tableA and tableC have been ordered by id, count,date whereas tableB has been ordered by id,count,count_number,dt.
tableA has 6435486 rows (6.43 million),tableB has 20665507 rows (~20 million) and tableC has 6358059 (6.35 million). The generated view contains 20665478 rows (~20 million).
The median(time) calculation on tableA takes around 6 seconds. tableB doesn't contain time variable but median run on an equivalent variable takes around 20 seconds.
However, median(time) on the generated view takes around 235 seconds.
The EXPLAIN PLAN for the view is given below:
QUERY PLAN
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
EXPLAIN SELECT COUNT(*) FROM view1
Access Path:
+-GROUPBY NOTHING [Cost: 4M, Rows: 1 (NO STATISTICS)] (PATH ID: 1)
| Aggregates: count(*)
| +---> JOIN HASH [RightOuter] [Cost: 4M, Rows: 133K (NO STATISTICS)] (PATH ID: 3)
| | Join Cond: (a.id = c.id) AND (a.count = c.count) AND (a.date = c.date)
| | Materialize at Input: b.id, b.count, b.date
| | +-- Outer -> STORAGE ACCESS for c [Cost: 90K, Rows: 6M (NO STATISTICS)] (PATH ID: 4)
| | | Projection: tableC_super
| | | Materialize: c.id, c.count, c.date
| | | Runtime Filters: (SIP1(HashJoin): c.id), (SIP2(HashJoin): c.count), (SIP3(HashJoin): c.date), (SIP4(HashJoin): c.id, c.count, c.date)
| | +-- Inner -> JOIN HASH [Cost: 4M, Rows: 133K (NO STATISTICS)] (PATH ID: 5)
| | | Join Cond: (a.id = b.id) AND (a.count = b.count) AND (a.date = b.date)
| | | +-- Outer -> STORAGE ACCESS for b [Cost: 260K, Rows: 21M (NO STATISTICS)] (PATH ID: 6)
| | | | Projection: tableB_super
| | | | Materialize: b.id, b.count, b.date
| | | | Runtime Filters: (SIP5(HashJoin): b.id), (SIP6(HashJoin): b.count), (SIP7(HashJoin): b.date), (SIP8(HashJoin): b.id, b.count, b.date)
| | | +-- Inner -> STORAGE ACCESS for a [Cost: 91K, Rows: 6M (NO STATISTICS)] (PATH ID: 7)
| | | | Projection: tableA_super
| | | | Materialize: a.id, a.count, a.date
------------------------------
-----------------------------------------------
PLAN: BASE QUERY PLAN (GraphViz Format)
-----------------------------------------------
digraph G {
graph [rankdir=BT, label = "BASE QUERY PLAN\nQuery: EXPLAIN SELECT COUNT(*) FROM \"view1\"\n\nAll Nodes Vector: \n\n node[0]=v_test_node0001 (initiator) Up\n", labelloc=t, labeljust=l ordering=out]
0[label = "Root \nOutBlk=[UncTuple]", color = "green", shape = "house"];
1[label = "NewEENode \nOutBlk=[UncTuple]", color = "green", shape = "box"];
2[label = "GroupByPipe: 0 keys\nAggs:\n count(*)\nUnc: Integer(8)", color = "green", shape = "box"];
3[label = "ExprEval: \n 1\nUnc: Integer(8)", color = "green", shape = "box"];
4[label = "ExprEval: \n a.id\nUnc: Varchar(200)", color = "green", shape = "box"];
5[label = "Join: Hash-Join: \n(tableC x tableA) using tableC_super and previous join (PATH ID: 3)\n[RightOuter]\n\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "green", shape = "box"];
6[label = "StorageUnionStep: tableC_super\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "purple", shape = "box"];
7[label = "ScanStep: tableC_super\nSIP1(HashJoin): c.id\nSIP2(HashJoin): c.count\nSIP3(HashJoin): c.date\nSIP4(HashJoin): c.id, c.count, c.date\nid\ncount\ndate\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "brown", shape = "box"];
8[label = "StorageUnionStep: tableB_super\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "purple", shape = "box"];
9[label = "Join: Hash-Join: \n(tableB x tableA) using tableB_super and tableA_super (PATH ID: 5)\n\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "brown", shape = "box"];
10[label = "ScanStep: tableB_super\nSIP5(HashJoin): b.id\nSIP6(HashJoin): b.count\nSIP7(HashJoin): b.date\nSIP8(HashJoin): b.id, b.count, b.date\nid\ncount\ndate\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "brown", shape = "box"];
11[label = "StorageUnionStep: tableA_super\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "purple", shape = "box"];
12[label = "ScanStep: tableA_super\nid\ncount\ndate\nUnc: Varchar(200)\nUnc: Integer(8)\nUnc: Varchar(25)", color = "brown", shape = "box"];
1->0 [label = "V[0]",color = "black"];
2->1 [label = "0",color = "blue"];
3->2 [label = "0",color = "blue"];
4->3 [label = "0",color = "blue"];
5->4 [label = "0",color = "blue"];
6->5 [label = "0",color = "blue"];
7->6 [label = "0",color = "blue"];
8->5 [label = "1",color = "blue"];
9->8 [label = "0",color = "blue"];
10->9 [label = "0",color = "blue"];
11->9 [label = "1",color = "blue"];
12->11 [label = "0",color = "blue"];
}
Could you please let me know if there is a way to optimize the view generation query and speed up the median calculation on the view generated?
Thank you.
Ravi