How to understand "union" and "group by" of "explain plan" ?
I'm comparing the performance between "left outer join" and "not exists" . I get the "explain plan" from "query_plan_profiles" .
But I don't understand some key words .
SQL:
insert into t3
select t1.*
from t1
where not exists (
select 1 from t2 where t1.bk = t2.bk
);
Explain Plan:
+-DML INSERT [Cost: 0, Rows: 0]
| Target Projection: v10poc.t3_b1 (RESEGMENT)
| Target Projection: v10poc.temp_msg_b1 (RESEGMENT)
| Target Projection: v10poc.temp_plan_exists_b1 (RESEGMENT)
| Target Projection: v10poc.t3_b0 (RESEGMENT)
| Target Projection: v10poc.temp_msg_b0 (RESEGMENT)
| Target Projection: v10poc.temp_plan_exists_b0 (RESEGMENT)
| Target Prep:
| Execute on: All Nodes
| Execute on: Query Initiator
| Execute on: All Nodes
| Execute on: Query Initiator
| Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 42K, Rows: 1 (NO STATISTICS)] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| +---> JOIN MERGEJOIN(inputs presorted) [Anti] [Cost: 14M, Rows: 11M] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| | Join Cond: (query_plan_profiles.transaction_id = VAL(2))
| | Join Cond: (t1.bk = VAL(1))
| | Execute on: All Nodes
| | +-- Outer -> SELECT [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 4)
| | +-- Outer -> STORAGE ACCESS for t1 [Cost: 549K, Rows: 23M] (PATH ID: 4)
| | | Execute on: All Nodes
| | | Projection: v10poc.t1_b0
| | | Materialize: t1.bk, t1.msg1, t1.msg2
| | | Execute on: All Nodes
| | | +---> STORAGE ACCESS for t2 [Cost: 389K, Rows: 23M] (PATH ID: 6)
| | | | +---> UNION [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 6)
| | | | Projection: v10poc.t2_b0
| | | | | Execute on: All Nodes
| | | | Materialize: t2.bk
| | | | | Execute on: All Nodes
| | | | Execute on: All Nodes
SQL:
insert into t4
select t11.*
from t11
left join t22
on (t11.bk = t22.bk)
where t22.msg1 is null
Explain Plan:
+-DML INSERT [Cost: 0, Rows: 0]
| Target Projection: v10poc.t4_b1 (RESEGMENT)
| Target Projection: v10poc.temp_msg_b1 (RESEGMENT)
| Target Projection: v10poc.temp_plan_left_b1 (RESEGMENT)
| Target Projection: v10poc.t4_b0 (RESEGMENT)
| Target Projection: v10poc.temp_msg_b0 (RESEGMENT)
| Target Projection: v10poc.temp_plan_left_b0 (RESEGMENT)
| Target Prep:
| Execute on: All Nodes
| Execute on: Query Initiator
| Execute on: All Nodes
| Execute on: Query Initiator
| Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 42K, Rows: 1 (NO STATISTICS)] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| +---> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 16M, Rows: 23M] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| | Join Cond: (query_plan_profiles.transaction_id = VAL(2))
| | Join Cond: (t11.bk = t22.bk)
| | Execute on: All Nodes
| | Filter: (t22.msg1 IS NULL)
| | Execute on: All Nodes
| | +-- Outer -> SELECT [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 4)
| | +-- Outer -> STORAGE ACCESS for t11 [Cost: 2M, Rows: 23M] (PATH ID: 4)
| | | Execute on: All Nodes
| | | Projection: v10poc.t11_b0
| | | Materialize: t11.bk, t11.msg1, t11.msg2
| | | Execute on: All Nodes
| | +-- Inner -> STORAGE ACCESS for t22 [Cost: 966K, Rows: 23M] (PATH ID: 5)
| | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 5)
| | | Projection: v10poc.t22_b0
"| | | | Group By:
""*SELECT* 1"".transaction_id, ""*SELECT* 1"".statement_id, ""*SELECT* 1"".path_id, ""*SELECT* 1"".path_line_index,
""*SELECT* 1"".path_is_started, ""*SELECT* 1"".path_is_completed, ""*SELECT* 1"".is_executing, ""*SELECT* 1"".running_time,
""*SELECT* 1"".memory_allocated_bytes, ""*SELECT* 1"".read_from_disk_bytes, ""*SELECT* 1"".received_bytes, ""*SELECT* 1"".sent_bytes, ""*SELECT* 1"".path_line"
| | | Materialize: t22.bk, t22.msg1
| | | | Execute on: All Nodes
| | | Execute on: All Nodes
So, could you explain what's the meaning of "union" and "group by" ?
But I don't understand some key words .
SQL:
insert into t3
select t1.*
from t1
where not exists (
select 1 from t2 where t1.bk = t2.bk
);
Explain Plan:
+-DML INSERT [Cost: 0, Rows: 0]
| Target Projection: v10poc.t3_b1 (RESEGMENT)
| Target Projection: v10poc.temp_msg_b1 (RESEGMENT)
| Target Projection: v10poc.temp_plan_exists_b1 (RESEGMENT)
| Target Projection: v10poc.t3_b0 (RESEGMENT)
| Target Projection: v10poc.temp_msg_b0 (RESEGMENT)
| Target Projection: v10poc.temp_plan_exists_b0 (RESEGMENT)
| Target Prep:
| Execute on: All Nodes
| Execute on: Query Initiator
| Execute on: All Nodes
| Execute on: Query Initiator
| Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 42K, Rows: 1 (NO STATISTICS)] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| +---> JOIN MERGEJOIN(inputs presorted) [Anti] [Cost: 14M, Rows: 11M] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| | Join Cond: (query_plan_profiles.transaction_id = VAL(2))
| | Join Cond: (t1.bk = VAL(1))
| | Execute on: All Nodes
| | +-- Outer -> SELECT [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 4)
| | +-- Outer -> STORAGE ACCESS for t1 [Cost: 549K, Rows: 23M] (PATH ID: 4)
| | | Execute on: All Nodes
| | | Projection: v10poc.t1_b0
| | | Materialize: t1.bk, t1.msg1, t1.msg2
| | | Execute on: All Nodes
| | | +---> STORAGE ACCESS for t2 [Cost: 389K, Rows: 23M] (PATH ID: 6)
| | | | +---> UNION [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 6)
| | | | Projection: v10poc.t2_b0
| | | | | Execute on: All Nodes
| | | | Materialize: t2.bk
| | | | | Execute on: All Nodes
| | | | Execute on: All Nodes
SQL:
insert into t4
select t11.*
from t11
left join t22
on (t11.bk = t22.bk)
where t22.msg1 is null
Explain Plan:
+-DML INSERT [Cost: 0, Rows: 0]
| Target Projection: v10poc.t4_b1 (RESEGMENT)
| Target Projection: v10poc.temp_msg_b1 (RESEGMENT)
| Target Projection: v10poc.temp_plan_left_b1 (RESEGMENT)
| Target Projection: v10poc.t4_b0 (RESEGMENT)
| Target Projection: v10poc.temp_msg_b0 (RESEGMENT)
| Target Projection: v10poc.temp_plan_left_b0 (RESEGMENT)
| Target Prep:
| Execute on: All Nodes
| Execute on: Query Initiator
| Execute on: All Nodes
| Execute on: Query Initiator
| Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 42K, Rows: 1 (NO STATISTICS)] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| +---> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 16M, Rows: 23M] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| | Join Cond: (query_plan_profiles.transaction_id = VAL(2))
| | Join Cond: (t11.bk = t22.bk)
| | Execute on: All Nodes
| | Filter: (t22.msg1 IS NULL)
| | Execute on: All Nodes
| | +-- Outer -> SELECT [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 4)
| | +-- Outer -> STORAGE ACCESS for t11 [Cost: 2M, Rows: 23M] (PATH ID: 4)
| | | Execute on: All Nodes
| | | Projection: v10poc.t11_b0
| | | Materialize: t11.bk, t11.msg1, t11.msg2
| | | Execute on: All Nodes
| | +-- Inner -> STORAGE ACCESS for t22 [Cost: 966K, Rows: 23M] (PATH ID: 5)
| | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 5)
| | | Projection: v10poc.t22_b0
"| | | | Group By:
""*SELECT* 1"".transaction_id, ""*SELECT* 1"".statement_id, ""*SELECT* 1"".path_id, ""*SELECT* 1"".path_line_index,
""*SELECT* 1"".path_is_started, ""*SELECT* 1"".path_is_completed, ""*SELECT* 1"".is_executing, ""*SELECT* 1"".running_time,
""*SELECT* 1"".memory_allocated_bytes, ""*SELECT* 1"".read_from_disk_bytes, ""*SELECT* 1"".received_bytes, ""*SELECT* 1"".sent_bytes, ""*SELECT* 1"".path_line"
| | | Materialize: t22.bk, t22.msg1
| | | | Execute on: All Nodes
| | | Execute on: All Nodes
So, could you explain what's the meaning of "union" and "group by" ?
0
Comments
Lets concentrate on the first query and explain plan. The path ID in the explain plan , mentions the sequence in which the operations are done in Vertica for your query, these sequence/operations are decided by the query planner in Vertica.
So Path ID 3 - states this statement It does a join and gets the resultset from t2
Path Id 4 - states this statement This statement gets the resultset from table t1
Path Id - 6 states this statement
UNION between resultset from t1 and resultset from t2
That is the reason you are seeing a UNION in explain plan.
Coming to second part of query
Vertica always has a group by clause when doing some aggregation, but here I cannot see any aggregation in your query
This might be because of the left join applied, can you try if the same occurs with inner join.
In both the cases, I am sure, the best performance will be given by NOT EXIST clause.
Some points to note seeing your explain plan
- Check Statistics and update statistics
- RESEGMENT in you plan states the data has to be re-segmented during query run time, which is time consuming, check your data distribution and the column on which it is distributed.
Hope this helpsHere is a feeble attempt to explain what you are seeing
- It appears you are looking for rows in t1 that do not exist in t2.
- the "insert" aspect of the query is not material to your question. so I will work with only the select portion of the query.
Following queries A,B,C are symantically equivalent.
(A and C are what you are using in your example, i show B as a possible optimization )
A. select t1.* from t1 where not exists ( select 1 from t2 where t1.bk = t2.bk );
B. select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.bk is null;
C. select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.msg1 is null; (this will have the same effect as B. more on this later but is sub-optimal to B )
However the way they are processed somewhat is different.
A. is processed using a Anti Join. Basically t1 is read from disk and values in t1 that do not match any value in t2 are written to output .
B. is Left outer join between t1 (outer) and t2 (inner) and subsequently a null filter is applied before the output is written to filter out values in t1 that do not exist in t2.
==== Explanation of B/C ====
Consider
skeswani=> select * from t1;
bk
----
1
1
2
2
(4 rows)
skeswani=> select * from t2;
bk | msg1 | whatever1
----+------+-----------
3 | 20 | 200
1 | 10 | 100
1 | 20 | 200
3 | 10 | 100
(4 rows)
skeswani=> select t1.*, t2.* from t1 left join t2 on (t1.bk = t2.bk);
bk | bk | msg1 | whatever1
----+----+------+-----------
1 | 1 | 10 | 100
1 | 1 | 10 | 100
1 | 1 | 20 | 200
1 | 1 | 20 | 200
2 | | |
2 | | |
--- As you can see, non-matching values of t2 (bk and msg1) null, both B and C will have the same output
skeswani=> select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.bk is null;
bk
----
2
2
(2 rows)
skeswani=> select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.msg1 is null;
bk
----
2
2
(2 rows)
skeswani=> explain select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.bk is null;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
explain select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.bk is null;
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 408, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
| Join Cond: (t1.bk = t2.bk)
| Filter: (t2.bk IS NULL)
| Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for t1 [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| | Projection: public.t1_b0
| | Materialize: t1.bk
| | Execute on: All Nodes
| +-- Inner -> STORAGE ACCESS for t2 [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
| | Projection: public.t2_b0
| | Materialize: t2.bk
| | Execute on: All Nodes
skeswani=> explain select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.msg1 is null; QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
explain select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.msg1 is null;
Access Path:
+-JOIN MERGEJOIN(inputs presorted) [RightOuter] [Cost: 648, Rows: 10K (NO STATISTICS)] (PATH ID: 1) Outer (RESEGMENT)
| Join Cond: (t1.bk = t2.bk)
| Filter: (t2.msg1 IS NULL)
| Execute on: All Nodes
| +-- Outer -> STORAGE ACCESS for t2 [Cost: 403, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
| | Projection: public.t2_b0
| | Materialize: t2.bk, t2.msg1
| | Execute on: All Nodes
| +-- Inner -> STORAGE ACCESS for t1 [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
| | Projection: public.t1_b0
| | Materialize: t1.bk
| | Execute on: All Nodes
*** Now to understanding why the Group By. ***
In order to do the LOJ (Left Outer Join), we do not need to the entire inner table t2.
We only need the distinct values of the join key (bk) and ouput col (msg1) in t2. The other cols can be replaced by 1.
(arugable only distinct values of bk are needed, hence B should be faster than C, and have the same results, since you dont need to materialize msg1 for correctness)
-- the equivalent rewrite would be the following. This omptimization reduces the size of the inner and number of comparisons required. The Group By you see is evaluating the distinct values by replacing other col by 1.
skeswani=> select t1.* from t1 left outer join ( select distinct bk, msg1 from t2) as t22 on (t1.bk = t22.bk) where t22.msg1 is null;
bk
----
2
2
(2 rows)
-- I believe a optimal rewrite would be the following where we materialize fewer col (hence faster)
skeswani=> select t1.* from t1 left outer join ( select distinct bk from t2) as t22 on (t1.bk = t22.bk) where t22.bk is null;
bk
----
2
2
(2 rows)
*** Now to understanding why the Union. ***
Unfortunately, i cannot figure out the why the union is needed in query A.
The plan you have posted appears truncated so its hard to be sure. I do not see the ineer part of the MergeJoin or the Hash Join
I suspect Union provides a similar optimization in the first plan, where it eliminates duplicates to the extent possible.
When we combine the inputs of the inner most join (select 1 from t2 where t1.bk = t2.bk) we only need the distinct values.
I hope this answers at least part of the question.
Thanks everybody .
Now I'm doubting my way to get explain plan .
Below is the SQL to produce the explain plan for "left outer join".
insert into temp_msg values( 'left' , sysdate()) ;
commit;
insert into t4
select t11.*
from t11
left join t22 on (t11.bk = t22.bk)
where t22.msg1 is null ;
insert into temp_msg values( 'left' , sysdate()) ;
insert into temp_plan_left
select * from query_plan_profiles
where transaction_id in (
select transaction_id from sessions where session_id = current_session()
) and statement_id = 1 ;
insert into profile_left
Select * from v_monitor.execution_engine_profiles
where transaction_id in (
select transaction_id from sessions where session_id = current_session()
) and statement_id = 1 ;
commit;
Thus, I got "group by" and "union" in my explain plan .
If I just explain the sql in VSQL like you , there's no "group by" and "union" in it . This is weird .
I don't know what's wrong .
dddd
Can you post your complete explain plan along with the SQL statment like this Also, Are you trying to see explain plan from other tool.
There can be three tools to understand explain plan with Vertica
1. vsql
2. Management console
3. Graphviz
Hope this helps.
This is the point I don't understand now .
Thanks.