Query event 'Consider a cosegmented, cosorted join' ?

siva · February 2015

For one of may query, query_event table has below suggestion.

"Ensure that the plan is reasonable. Consider a cosegmented, cosorted join. Increase memory available to the plan."

cosorted join:
I assume it means both the projections should have join keys as sort keys.
But does the order of sort keys matter ?
Also can one of the projections have more sort key then the other, assuming I want merge join ?

cosegmented:

Does this mean they have to be segmented on the same columns ?

Strategy I am following for segmentation is to segment them by more a common key(which exists in all tables) and then by tables specific key. Is this correct approach ?

Thanks
-Siva

Marco_Gessner · February 2015

Dear Siva -

When you join two big tables , say, sales_fact and customer_dim - and especially if the join between those tables occurs in an important part of all queries -

then,

* you want to sort them by the join column (which makes them co-sorted). Here, you can, if you like, add other columns to the sort order, after the join column.

* you want to segment them by the join column - and only by the join column, like in:
segmented by HASH(customer_key).
This assures that all rows of sales_fact with a customer_key of 42 will reside on the same node as the row of customer_dim with customer_key of 42. And this ensures that, to make the join between the two tables, no re-segmentation is necessary in the query plan, this means: no rows need to travel from one node to another -> this is a co-located join.
But this only works if you segment the two tables by the HASH() of the same column - or column combination: HASH(42) and HASH(42, date '2015-02-15') will produce different results, and the concerned row will go to different nodes.

You will see that you've done a good job when, in the output of your EXPLAIN SELECT .. , in the join of the two biggest tables, you will see: MERGE_JOIN (the fastest possible join, thanks to co-sorted join tables), and you won't see RESEGMENT (showing that you have a co-located join).

Obviously, remember to keep small tables UNSEGMENTED ALL NODES, and only to segment big tables.

Happy playing , and good luck -

marco

id10t · February 2015

Hi!

Very nice answer of Marco Gessner.

I just will add my 2 cents about EXPLAIN <query>: in the output of EXPLAIN VERBOSE <query> you don't wanna see a "network cost", only in a last path(a reduce job).

siva · February 2015

Thanks for the detailed answer.

Should the segmented columns be unique ? I have a scenario where relation between 2 tables is Many to One. Many side is central table.

If I include only join columns in the segment then many side will have frequent repetitions. i.e 100 to 200 rows with same values.

In this case can I include a different unique key on the main table segmentation ? Will it still be cosegmented ?

id10t · February 2015

Hi!

Should the segmented columns be unique ?

No(in case it sorted by JOIN column/s).
Hint - RLE.

We're Moving!

Create My New Community Account Now

Query event 'Consider a cosegmented, cosorted join' ?

Comments

Leave a Comment