Vertica joins on large data sets is extremely slow

abtsantosbentenabtsantosbenten Registered User
edited June 11 in Vertica Forum

version: 9.0
8 nodes
170GB of memory available for transactions per node

I have this simple join query that runs for 11 minutes.
After implementing the suggested projections I got from the database designer, it was reduced to 7 mins. Not a very huge improvement.

Is vertica slow with joins on large data sets? Makes one question if this is really a big data database.

select count(*)
from fact_table01 m
JOIN fact_table02 s
ON m.id = s.id
JOIN fact_table03 i
ON m.id = i.id
and m.id = i.id
;

Fact tables in the query have at least 5B records.

The funny thing is, I checked the QUERY_EVENTS and I saw that there were suggestions like "Consider using a sorted projection." Which is pretty funny considering the projection used was from the database designer.

Comments

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert
    edited June 12

    Vertica works best if all of joined tables contain zero rows. Ha ha. Just kidding :smile:

    Are all of the projections involved segmented on the ID column?

    See:
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/AvoidingResegmentationDuringJoins.htm

    And are they sorted by the ID column?

    Also, for fun, try:

    select count(*) 
    from fact_table01 m 
    JOIN fact_table02 s 
    ON m.id = s.id 
    JOIN fact_table03 i 
    ON m.id = i.id 
    ;
    

    And...

    select count(1) 
    from fact_table01 m 
    JOIN fact_table02 s 
    ON m.id = s.id 
    JOIN fact_table03 i 
    ON m.id = i.id 
    ;
    

    And...

    select count(m.id) 
    from fact_table01 m 
    JOIN fact_table02 s 
    ON m.id = s.id 
    JOIN fact_table03 i 
    ON m.id = i.id 
    ;
    
  • abtsantosbentenabtsantosbenten Registered User

    Thanks for the response. I noticed that I made a mistake in the query I posted:

    it should have been this where the last condition is the one that was updated:

    select count(*)
    from fact_table01 m
    JOIN fact_table02 s
    ON m.s_id = s.id
    JOIN fact_table03 i
    ON m.s_id = i.s_id
    and m.i_id = i.id
    ;

    the projections of the tables:

    fact_table01
    ordered by s_id, i_id
    segmented on s_id, i_id

    fact_table02
    ordered by s_id
    segmented by s_id

    fact_table03
    ordered by s_id, i_id
    segmented by s_id, i_id

  • Jim_KnicelyJim_Knicely Administrator, Moderator, Employee, Registered User, VerticaExpert

    @abtsantosbenten - Also, maybe consider using a Vertica flattened table.

    See:
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/FlattenedTables.htm

    Note:

    Vertica 9.2 includes faster refresh of tables with multiple Live Aggregate Projections (LAP).

    FTs help deliver fast query performance on larger data sets. BTW, Flattened Tables are a Denormalized Table of (a part of) a Normalized Schema (Star Schema) or it is a Table with normalized and denormalized columns, where query performance is of essence.

    Before 9.2, if a column is referenced by any LAP, that column cannot be applied Flattened Table refresh_columns operation. In 9.2 we support refresh_columns (rebuild mode) on Flattened table with LAP defined on it.. Only aggregate LAP is supported. Not TopK or UDF.

  • abtsantosbentenabtsantosbenten Registered User

    Again, thanks for accommodating this issue. I'm grateful because I haven't had much attention on the issues I've posted. Your response is much appreciated.

    I like your suggestion, but I came across this discussion that said flattened tables were made for small dimension tables. In my case, the tables being joined are all fact tables with at least 5B records.

    link to the article mentioned:
    https://forum.vertica.com/discussion/240529/slow-performance-on-flattened-facts-table

  • skeswaniskeswani Employee, Registered User, VerticaExpert

    I am just curious, can you post the output of

    explain select count(*)
    from fact_table01 m
    JOIN fact_table02 s
    ON m.s_id = s.id
    JOIN fact_table03 i
    ON m.s_id = i.s_id
    and m.i_id = i.id;

  • skeswaniskeswani Employee, Registered User, VerticaExpert
    edited June 13

    based on limited info, looking at your projection design, you want table01 and table03 to join first. Then the result of that join with table02.

    the first join maybe a MergeJoin (but not necessarily) and the second one will be a hash join. we might be able to force a specific join order using hints.

  • abtsantosbentenabtsantosbenten Registered User

    QUERY PLAN DESCRIPTION:

    Access Path:
    +-GROUPBY NOTHING [Cost: 4M, Rows: 1] (PATH ID: 1)
    | Aggregates: count()
    | Execute on: All Nodes
    | +---> JOIN MERGEJOIN(inputs presorted) [Cost: 4M, Rows: 4] (PATH ID: 2) Inner (RESEGMENT)
    | | Join Cond: (m.s_id = i.s_id) AND (m.i_id = i.id)
    | | Execute on: All Nodes
    | | +-- Outer -> STORAGE ACCESS for i [Cost: 584K, Rows: 996M] (PATH ID: 3)
    | | | Projection: myschema.fact_table03
    | | | Materialize: i.s_id, i.i_id
    | | | Execute on: All Nodes
    | | | Runtime Filters: (SIP1(MergeJoin): i.s_id), (SIP2(MergeJoin): i.id), (SIP3(MergeJoin): i.s_id, i.id)
    | | +-- Inner -> JOIN MERGEJOIN(inputs presorted) [Cost: 2M, Rows: 973M] (PUSHED GROUPING) (PATH ID: 4) Outer (RESEGMENT)
    | | | Join Cond: (m.s_id = s.id)
    | | | Execute on: All Nodes
    | | | +-- Outer -> STORAGE ACCESS for m [Cost: 701K, Rows: 973M] (PATH ID: 5)
    | | | | Projection: myschema.fact_table01
    | | | | Materialize: m.s_id, m.i_id
    | | | | Execute on: All Nodes
    | | | +-- Inner -> STORAGE ACCESS for s [Cost: 156K, Rows: 533M] (PATH ID: 6)
    | | | | Projection: myschema.fact_table02
    | | | | Materialize: s.id
    | | | | Execute on: All Nodes

  • abtsantosbentenabtsantosbenten Registered User

    This plan is already using a projection from DBD. The plan originally showed a hash join.

  • skeswaniskeswani Employee, Registered User, VerticaExpert
    edited June 13

    can you try...
    I am curious about what timing you get on this one

    vsql> select add_vertica_options('OPT', 'PlanNoMJ');
    vsql> \timing
    vsql> select count(*) from fact_table01 m JOIN fact_table02 s ON m.s_id = s.id JOIN fact_table03 i ON m.s_id = i.s_id and m.i_id = i.id;
    select clr_vertica_options('OPT', 'PlanNoMJ');

  • abtsantosbentenabtsantosbenten Registered User

    I can only test the performance in the PROD environment because it's the only place with the large data set.

    I need to know first what this does: add_vertica_options('OPT', 'PlanNoMJ');

    How come there are no search results for add_vertica_options in the documentation?

  • skeswaniskeswani Employee, Registered User, VerticaExpert
    edited June 14

    This options causes the optimizer to avoid a merge join, in lieu of a hash join. Its not a documented option. It will allow for more threads to work on the join. I just want to see the performance of this option (sometimes it helps). Fact on Fact joins can sometime benefit from this.
    When you run this query with this option, also get the explain plan.

  • abtsantosbentenabtsantosbenten Registered User

    it ran for almost 4 mins (3 mins 48 secs)


    QUERY PLAN DESCRIPTION:

    Opt Vertica Options

    PLAN_NO_MJ

    explain select /+add_vertica_options('OPT', 'PlanNoMJ')/ count(1) from (...removed...)

    Access Path:
    +-GROUPBY NOTHING [Cost: 8M, Rows: 1] (PATH ID: 1)
    | Aggregates: count()
    | Execute on: All Nodes
    | +---> JOIN HASH [Cost: 8M, Rows: 4] (PATH ID: 2) Inner (RESEGMENT)
    | | Join Cond: (m.s_id = i.s_id) AND (m.i_id = i.id)
    | | Execute on: All Nodes
    | | +-- Outer -> STORAGE ACCESS for i [Cost: 584K, Rows: 996M] (PATH ID: 3)
    | | | Projection: myschema.fact_table03
    | | | Materialize: i.s_id, i.id
    | | | Execute on: All Nodes
    | | | Runtime Filters: (SIP1(HashJoin): i.s_id), (SIP2(HashJoin): i.id), (SIP3(HashJoin): i.s_id, i.id)
    | | +-- Inner -> JOIN HASH [Cost: 3M, Rows: 973M] (PUSHED GROUPING) (PATH ID: 4) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
    | | | Join Cond: (m.i_id = s.i_id)
    | | | Execute on: All Nodes
    | | | +-- Outer -> STORAGE ACCESS for m [Cost: 573K, Rows: 973M] (PATH ID: 5)
    | | | | Projection: myschema.fact_table01
    | | | | Materialize: m.s_id, m.i_id
    | | | | Execute on: All Nodes
    | | | +-- Inner -> STORAGE ACCESS for s [Cost: 156K, Rows: 533M] (PATH ID: 6)
    | | | | Projection: myschema.fact_table02
    | | | | Materialize: s.id
    | | | | Execute on: All Nodes

  • skeswaniskeswani Employee, Registered User, VerticaExpert

    Plan looks as expected with this option.
    When you get a chance can you tell me the performance of the query with this option.

  • abtsantosbentenabtsantosbenten Registered User

    it ran for almost 4 mins (3 mins 48 secs)

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file