We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


Query event 'Consider a cosegmented, cosorted join' ? — Vertica Forum

Query event 'Consider a cosegmented, cosorted join' ?

For one of may query, query_event table has below suggestion.  

"Ensure that the plan is reasonable.  Consider a cosegmented, cosorted join.  Increase memory available to the plan."

cosorted join:
I assume it means both the projections should have join keys as sort keys.  
But does the order of sort keys matter ?
Also can one of the projections have more sort key then the other, assuming I want merge join ?

cosegmented:

Does this mean  they have to be segmented on the same columns ?

Strategy I am following for segmentation is to segment them by more a common key(which exists in all tables) and then by tables specific key. Is this correct approach ?


Thanks
-Siva


Comments

  • Dear Siva  -

    When you join two big tables , say, sales_fact and customer_dim - and especially if the join between those tables occurs in an important part of all queries -

    then,

    * you want to sort them by the join column (which makes them co-sorted). Here, you can, if you like, add other columns to the sort order, after the join column.

    * you want to segment them by the join column - and only by the join column, like in:
    segmented by HASH(customer_key).
    This assures that all rows of sales_fact with a customer_key of 42 will reside on the same node as the row of customer_dim with customer_key of 42. And this ensures that, to make the join between the two tables, no re-segmentation is necessary in the query plan, this means: no rows need to travel from one node to another -> this is a co-located join.
    But this only works if you segment the two tables by the HASH() of the same column - or column combination: HASH(42) and HASH(42, date '2015-02-15') will produce different results, and the concerned row will go to different nodes.

    You will see that you've done a good job when, in the output of your EXPLAIN SELECT .. , in the join of the two biggest tables, you will see: MERGE_JOIN (the fastest possible join, thanks to co-sorted join tables), and you won't see RESEGMENT (showing that you have a co-located join).

    Obviously, remember to keep small tables UNSEGMENTED ALL NODES, and only to segment big tables.

    Happy playing , and good luck -

    marco

  • Hi!

    Very nice answer of Marco Gessner.

    I just will add my 2 cents about EXPLAIN <query>: in the output of EXPLAIN VERBOSE <query> you don't wanna see a "network cost", only in a last path(a reduce job).
  • Thanks for the detailed answer. 

    Should the segmented columns be unique ? I have a scenario where relation between 2 tables is Many to One.  Many side is central table.

    If I include only join columns in the segment then many side will have frequent repetitions. i.e 100 to 200 rows with same values.

    In this case can I include a different unique key on the main table segmentation ? Will it still be cosegmented ?
  • Hi!
    Should the segmented columns be unique ?
    No(in case it sorted by JOIN column/s).
    Hint - RLE.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file
You can use Markdown in your post.