Query event 'Consider a cosegmented, cosorted join' ?

For one of may query, query_event table has below suggestion.  

"Ensure that the plan is reasonable.  Consider a cosegmented, cosorted join.  Increase memory available to the plan."

cosorted join:
I assume it means both the projections should have join keys as sort keys.  
But does the order of sort keys matter ?
Also can one of the projections have more sort key then the other, assuming I want merge join ?

cosegmented:

Does this mean  they have to be segmented on the same columns ?

Strategy I am following for segmentation is to segment them by more a common key(which exists in all tables) and then by tables specific key. Is this correct approach ?


Thanks
-Siva


Comments

  • Dear Siva  -

    When you join two big tables , say, sales_fact and customer_dim - and especially if the join between those tables occurs in an important part of all queries -

    then,

    * you want to sort them by the join column (which makes them co-sorted). Here, you can, if you like, add other columns to the sort order, after the join column.

    * you want to segment them by the join column - and only by the join column, like in:
    segmented by HASH(customer_key).
    This assures that all rows of sales_fact with a customer_key of 42 will reside on the same node as the row of customer_dim with customer_key of 42. And this ensures that, to make the join between the two tables, no re-segmentation is necessary in the query plan, this means: no rows need to travel from one node to another -> this is a co-located join.
    But this only works if you segment the two tables by the HASH() of the same column - or column combination: HASH(42) and HASH(42, date '2015-02-15') will produce different results, and the concerned row will go to different nodes.

    You will see that you've done a good job when, in the output of your EXPLAIN SELECT .. , in the join of the two biggest tables, you will see: MERGE_JOIN (the fastest possible join, thanks to co-sorted join tables), and you won't see RESEGMENT (showing that you have a co-located join).

    Obviously, remember to keep small tables UNSEGMENTED ALL NODES, and only to segment big tables.

    Happy playing , and good luck -

    marco

  • Hi!

    Very nice answer of Marco Gessner.

    I just will add my 2 cents about EXPLAIN <query>: in the output of EXPLAIN VERBOSE <query> you don't wanna see a "network cost", only in a last path(a reduce job).
  • Thanks for the detailed answer. 

    Should the segmented columns be unique ? I have a scenario where relation between 2 tables is Many to One.  Many side is central table.

    If I include only join columns in the segment then many side will have frequent repetitions. i.e 100 to 200 rows with same values.

    In this case can I include a different unique key on the main table segmentation ? Will it still be cosegmented ?
  • Hi!
    Should the segmented columns be unique ?
    No(in case it sorted by JOIN column/s).
    Hint - RLE.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file