Query event 'Consider a cosegmented, cosorted join' ?
For one of may query, query_event table has below suggestion.
"Ensure that the plan is reasonable. Consider a cosegmented, cosorted join. Increase memory available to the plan."
cosorted join:
I assume it means both the projections should have join keys as sort keys.
But does the order of sort keys matter ?
Also can one of the projections have more sort key then the other, assuming I want merge join ?
cosegmented:
Does this mean they have to be segmented on the same columns ?
Strategy I am following for segmentation is to segment them by more a common key(which exists in all tables) and then by tables specific key. Is this correct approach ?
Thanks
-Siva
"Ensure that the plan is reasonable. Consider a cosegmented, cosorted join. Increase memory available to the plan."
cosorted join:
I assume it means both the projections should have join keys as sort keys.
But does the order of sort keys matter ?
Also can one of the projections have more sort key then the other, assuming I want merge join ?
cosegmented:
Does this mean they have to be segmented on the same columns ?
Strategy I am following for segmentation is to segment them by more a common key(which exists in all tables) and then by tables specific key. Is this correct approach ?
Thanks
-Siva
0
Comments
Dear Siva -
When you join two big tables , say, sales_fact and customer_dim - and especially if the join between those tables occurs in an important part of all queries -
then,
* you want to sort them by the join column (which makes them co-sorted). Here, you can, if you like, add other columns to the sort order, after the join column.
* you want to segment them by the join column - and only by the join column, like in:
segmented by HASH(customer_key).
This assures that all rows of sales_fact with a customer_key of 42 will reside on the same node as the row of customer_dim with customer_key of 42. And this ensures that, to make the join between the two tables, no re-segmentation is necessary in the query plan, this means: no rows need to travel from one node to another -> this is a co-located join.
But this only works if you segment the two tables by the HASH() of the same column - or column combination: HASH(42) and HASH(42, date '2015-02-15') will produce different results, and the concerned row will go to different nodes.
You will see that you've done a good job when, in the output of your EXPLAIN SELECT .. , in the join of the two biggest tables, you will see: MERGE_JOIN (the fastest possible join, thanks to co-sorted join tables), and you won't see RESEGMENT (showing that you have a co-located join).
Obviously, remember to keep small tables UNSEGMENTED ALL NODES, and only to segment big tables.
Happy playing , and good luck -
marco
Very nice answer of Marco Gessner.
I just will add my 2 cents about EXPLAIN <query>: in the output of EXPLAIN VERBOSE <query> you don't wanna see a "network cost", only in a last path(a reduce job).
Should the segmented columns be unique ? I have a scenario where relation between 2 tables is Many to One. Many side is central table.
If I include only join columns in the segment then many side will have frequent repetitions. i.e 100 to 200 rows with same values.
In this case can I include a different unique key on the main table segmentation ? Will it still be cosegmented ?
Hint - RLE.