How to understand "union" and "group by" of "explain plan" ?

I'm comparing the performance between "left outer join" and "not exists" .  I get the "explain plan" from "query_plan_profiles" .
But I don't understand some key words .
SQL:
insert into t3
select t1.*
from t1
where not exists (
  select 1 from t2 where t1.bk = t2.bk
);
Explain Plan:
+-DML INSERT [Cost: 0, Rows: 0]
|  Target Projection: v10poc.t3_b1 (RESEGMENT)
|  Target Projection: v10poc.temp_msg_b1 (RESEGMENT)
|  Target Projection: v10poc.temp_plan_exists_b1 (RESEGMENT)
|  Target Projection: v10poc.t3_b0 (RESEGMENT)
|  Target Projection: v10poc.temp_msg_b0 (RESEGMENT)
|  Target Projection: v10poc.temp_plan_exists_b0 (RESEGMENT)
|  Target Prep:
|  Execute on: All Nodes
|  Execute on: Query Initiator
|  Execute on: All Nodes
|  Execute on: Query Initiator
|  Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 42K, Rows: 1 (NO STATISTICS)] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| +---> JOIN MERGEJOIN(inputs presorted) [Anti] [Cost: 14M, Rows: 11M] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| |      Join Cond: (query_plan_profiles.transaction_id = VAL(2))
| |      Join Cond: (t1.bk = VAL(1))
| |      Execute on: All Nodes
| | +-- Outer -> SELECT [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 4)
| | +-- Outer -> STORAGE ACCESS for t1 [Cost: 549K, Rows: 23M] (PATH ID: 4)
| | |      Execute on: All Nodes
| | |      Projection: v10poc.t1_b0
| | |      Materialize: t1.bk, t1.msg1, t1.msg2
| | |      Execute on: All Nodes
| | | +---> STORAGE ACCESS for t2 [Cost: 389K, Rows: 23M] (PATH ID: 6)
| | | | +---> UNION [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 6)
| | | |      Projection: v10poc.t2_b0
| | | | |      Execute on: All Nodes
| | | |      Materialize: t2.bk
| | | | |      Execute on: All Nodes
| | | |      Execute on: All Nodes

SQL:
insert into t4
select t11.*
from t11
left join t22
 on (t11.bk = t22.bk)
where t22.msg1 is null
Explain Plan:
+-DML INSERT [Cost: 0, Rows: 0]
|  Target Projection: v10poc.t4_b1 (RESEGMENT)
|  Target Projection: v10poc.temp_msg_b1 (RESEGMENT)
|  Target Projection: v10poc.temp_plan_left_b1 (RESEGMENT)
|  Target Projection: v10poc.t4_b0 (RESEGMENT)
|  Target Projection: v10poc.temp_msg_b0 (RESEGMENT)
|  Target Projection: v10poc.temp_plan_left_b0 (RESEGMENT)
|  Target Prep:
|  Execute on: All Nodes
|  Execute on: Query Initiator
|  Execute on: All Nodes
|  Execute on: Query Initiator
|  Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 42K, Rows: 1 (NO STATISTICS)] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| +---> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 16M, Rows: 23M] (PATH ID: 3) Outer (RESEGMENT) Inner (RESEGMENT)
| |      Join Cond: (query_plan_profiles.transaction_id = VAL(2))
| |      Join Cond: (t11.bk = t22.bk)
| |      Execute on: All Nodes
| |      Filter: (t22.msg1 IS NULL)
| |      Execute on: All Nodes
| | +-- Outer -> SELECT [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 4)
| | +-- Outer -> STORAGE ACCESS for t11 [Cost: 2M, Rows: 23M] (PATH ID: 4)
| | |      Execute on: All Nodes
| | |      Projection: v10poc.t11_b0
| | |      Materialize: t11.bk, t11.msg1, t11.msg2
| | |      Execute on: All Nodes
| | +-- Inner -> STORAGE ACCESS for t22 [Cost: 966K, Rows: 23M] (PATH ID: 5)
| | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 41K, Rows: 2 (NO STATISTICS)] (PATH ID: 5)
| | |      Projection: v10poc.t22_b0
"| | | |      Group By:
""*SELECT* 1"".transaction_id, ""*SELECT* 1"".statement_id, ""*SELECT* 1"".path_id, ""*SELECT* 1"".path_line_index,
""*SELECT* 1"".path_is_started, ""*SELECT* 1"".path_is_completed, ""*SELECT* 1"".is_executing, ""*SELECT* 1"".running_time,
""*SELECT* 1"".memory_allocated_bytes, ""*SELECT* 1"".read_from_disk_bytes, ""*SELECT* 1"".received_bytes, ""*SELECT* 1"".sent_bytes, ""*SELECT* 1"".path_line"
| | |      Materialize: t22.bk, t22.msg1
| | | |      Execute on: All Nodes
| | |      Execute on: All Nodes

So, could you explain what's the meaning of "union" and "group by"  ?

Comments

  • Navin_CNavin_C Vertica Customer
    Hi Jason,

    Lets concentrate on the first query and explain plan.
    select t1.* 
    from t1 
    where not exists (
      select 1 from t2 where t1.bk = t2.bk
    );
    The path ID in the explain plan , mentions the sequence in which the operations are done in Vertica for your query, these sequence/operations are decided by the query planner in Vertica.

    So Path ID 3 - states this statement 
     select 1 from t2 where t1.bk = t2.bk
    It does a join and gets the resultset from t2

    Path Id 4 - states this statement
    select t1.* 
    from t1
    This statement gets the resultset from table t1

    Path Id - 6 states this statement

    UNION between resultset from t1 and resultset from t2

    That is the reason you are seeing a UNION in explain plan.

    Coming to second part of query

    Vertica always has  a group by clause when doing some aggregation, but here I cannot see any aggregation in your query

    This might be because of the left join applied, can you try if the same occurs with inner join.

    In both the cases, I am sure, the best performance will be given by NOT EXIST clause.
    Some points to note seeing your explain plan
    1. Check Statistics and update statistics
    2. RESEGMENT in you plan states the data has to be re-segmented during query run time, which is time consuming, check your data distribution and the column on which it is distributed.
    Hope this helps

  • Here is a feeble attempt to explain what you are seeing


    - It appears you are looking for rows in t1 that do not exist in t2.
    - the "insert" aspect of the query is not material to your question. so I will work with only the select portion of the query.


    Following queries A,B,C are symantically equivalent. 
    (A and C are what you are using in your example, i show B as a possible optimization )


    A. select t1.* from t1 where not exists ( select 1 from t2 where t1.bk = t2.bk );
    B. select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.bk is null;
    C. select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.msg1 is null; (this will have the same effect as B. more on this later but is sub-optimal to B )

    However the way they are processed somewhat is different.

    A. is processed using a Anti Join. Basically t1 is read from disk and values in t1 that do not match any value in t2 are written to output .

    B. is Left outer join between t1 (outer) and t2 (inner) and subsequently a null filter is applied before the output is written to filter out values in t1 that do not exist in t2.



    ==== Explanation of B/C ====


    Consider

    skeswani=> select * from t1;
     bk 
    ----
      1
      1
      2
      2
    (4 rows)

    skeswani=> select * from t2;
     bk | msg1 | whatever1 
    ----+------+-----------
      3 |   20 |       200
      1 |   10 |       100
      1 |   20 |       200
      3 |   10 |       100
    (4 rows)



    skeswani=> select t1.*, t2.* from t1 left join t2 on (t1.bk = t2.bk);
     bk | bk | msg1 | whatever1 
    ----+----+------+-----------
      1 |  1 |   10 |       100
      1 |  1 |   10 |       100
      1 |  1 |   20 |       200
      1 |  1 |   20 |       200
      2 |    |      |          
      2 |    |      |          

    --- As you can see, non-matching values of t2 (bk and msg1)  null, both B and C will have the same output

    skeswani=> select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.bk is null;
     bk 
    ----
      2
      2
    (2 rows)

    skeswani=> select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.msg1 is null;
     bk 
    ----
      2
      2
    (2 rows)


    skeswani=> explain select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.bk is null;
                                                                                                                                                     QUERY PLAN                                                                                                                                                 
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     ------------------------------ 
     QUERY PLAN DESCRIPTION: 
     ------------------------------
     
     explain select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.bk is null;
     
     Access Path:
     +-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 408, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
     |  Join Cond: (t1.bk = t2.bk)
     |  Filter: (t2.bk IS NULL)
     |  Execute on: All Nodes
     | +-- Outer -> STORAGE ACCESS for t1 [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
     | |      Projection: public.t1_b0
     | |      Materialize: t1.bk
     | |      Execute on: All Nodes
     | +-- Inner -> STORAGE ACCESS for t2 [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
     | |      Projection: public.t2_b0
     | |      Materialize: t2.bk
     | |      Execute on: All Nodes
     


    skeswani=> explain select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.msg1 is null;                                                                                                                                                  QUERY PLAN          
                                                                                                                                            
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------------------------
     ------------------------------ 
     QUERY PLAN DESCRIPTION: 
     ------------------------------
     
     explain select t1.* from t1 left join t2 on (t1.bk = t2.bk) where t2.msg1 is null;
     
     Access Path:
     +-JOIN MERGEJOIN(inputs presorted) [RightOuter] [Cost: 648, Rows: 10K (NO STATISTICS)] (PATH ID: 1) Outer (RESEGMENT)
     |  Join Cond: (t1.bk = t2.bk)
     |  Filter: (t2.msg1 IS NULL)
     |  Execute on: All Nodes
     | +-- Outer -> STORAGE ACCESS for t2 [Cost: 403, Rows: 10K (NO STATISTICS)] (PATH ID: 2)
     | |      Projection: public.t2_b0
     | |      Materialize: t2.bk, t2.msg1
     | |      Execute on: All Nodes
     | +-- Inner -> STORAGE ACCESS for t1 [Cost: 202, Rows: 10K (NO STATISTICS)] (PATH ID: 3)
     | |      Projection: public.t1_b0
     | |      Materialize: t1.bk
     | |      Execute on: All Nodes
     



    *** Now to understanding why the Group By. ***

    In order to do the LOJ (Left Outer Join), we do not need to the entire inner table t2. 
    We only need the distinct values of the join key (bk) and ouput col (msg1) in t2. The other cols can be replaced by 1. 
    (arugable only distinct values of bk are needed, hence B should be faster than C, and have the same results, since you dont need to materialize msg1 for correctness)


    -- the equivalent rewrite would be the following. This omptimization reduces the size of the inner and number of comparisons required. The Group By you see is evaluating the distinct values by replacing other col by 1.

    skeswani=> select t1.* from t1 left outer join ( select distinct bk, msg1 from t2) as t22 on (t1.bk = t22.bk) where t22.msg1 is null;
     bk 
    ----
      2
      2
    (2 rows)


    -- I believe a optimal rewrite would be the following where we materialize fewer col (hence faster)
    skeswani=> select t1.* from t1 left outer join ( select distinct bk from t2) as t22 on (t1.bk = t22.bk) where t22.bk is null;
     bk 
    ----
      2
      2
    (2 rows)



    *** Now to understanding why the Union. ***

    Unfortunately, i cannot figure out the why the union is needed in query A. 
    The plan you have posted appears truncated so its hard to be sure. I do not see the ineer part of the MergeJoin or the Hash Join
    I suspect Union provides a similar optimization in the first plan, where it eliminates duplicates to the extent possible.
     
    When we combine the inputs of the inner most join (select 1 from t2 where t1.bk = t2.bk) we only need the distinct values.



    I hope this answers at least part of the question.


     
  • Thanks everybody .
    Now I'm doubting my way to get explain plan .
    Below is the SQL to produce the explain plan for "left outer join".


    insert into temp_msg values( 'left' , sysdate()) ;
    commit;
    insert into t4
    select t11.*
    from t11
    left join t22 on (t11.bk = t22.bk)
    where t22.msg1 is null ;

    insert into temp_msg values( 'left' , sysdate()) ;

    insert into temp_plan_left
    select * from query_plan_profiles
    where transaction_id in (
     select transaction_id from sessions where session_id = current_session()
    ) and statement_id = 1 ;

    insert into  profile_left
    Select * from v_monitor.execution_engine_profiles
    where transaction_id in (
     select transaction_id from sessions where session_id = current_session()
    ) and statement_id = 1 ;
    commit;

     

    Thus, I got "group by" and "union" in my explain plan .

    If I just explain the sql in VSQL like you , there's no "group by" and "union" in it . This is weird .

    I don't know what's wrong .

  • test sql format 
  • ccccdddd
    dddd
    dddd
    sfae
  • Navin_CNavin_C Vertica Customer
    Hi Jason,

    Can you post your complete explain plan along with the SQL statment like this
    nnani=> explain select sysdate() from dual;                                                                                                                                         QUERY PLAN          
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     ------------------------------
     QUERY PLAN DESCRIPTION:
     ------------------------------
     explain select sysdate() from dual;
     Access Path:
     +-STORAGE ACCESS for dual [Cost: 663, Rows: 10K (NO STATISTICS)] (PATH ID: 1)
     |  Projection: v_catalog.dual_p
     |  Materialize: dual.dummy
     |  Execute on: Query Initiator
    Also, Are you trying to see explain plan from other tool.
    There can be three tools to understand explain plan with Vertica
    1. vsql
    2. Management console
    3. Graphviz

    Hope this helps.
  • Below is the explain plan from vsql  .

    dbadmin=> explain select t1.* from t1 where not exists ( select 1 from t2 where t1.bk = t2.bk );
      QUERY PLAN                                              
     ------------------------------
     QUERY PLAN DESCRIPTION:
     ------------------------------
     
     explain select t1.* from t1 where not exists ( select 1 from t2 where t1.bk = t2.bk );
     
     Access Path:
     +-JOIN MERGEJOIN(inputs presorted) [Anti] [Cost: 13M, Rows: 11M] (PATH ID: 1) Outer (RESEGMENT) Inner (RESEGMENT)
     |  Join Cond: (t1.bk = VAL(1))
     |  Execute on: All Nodes
     | +-- Outer -> STORAGE ACCESS for t1 [Cost: 457K, Rows: 23M] (PATH ID: 2)
     | |      Projection: v10poc.t1_b0
     | |      Materialize: t1.bk, t1.msg1, t1.msg2
     | |      Execute on: All Nodes
     | +-- Inner -> SELECT [Cost: 383K, Rows: 23M] (PATH ID: 3)
     | |      Execute on: All Nodes
     | | +---> STORAGE ACCESS for t2 [Cost: 383K, Rows: 23M] (PATH ID: 4)
     | | |      Projection: v10poc.t2_b0
     | | |      Materialize: t2.bk
     | | |      Execute on: All Nodes
     

    Before this, I tried to get explain plan from "query_plan_profiles". But that's different with the above .
    This is the point I don't understand now .

    Thanks.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file