Vertica joins on large data sets is extremely slow

edited June 2019 in General Discussion

version: 9.0
8 nodes
170GB of memory available for transactions per node

I have this simple join query that runs for 11 minutes.
After implementing the suggested projections I got from the database designer, it was reduced to 7 mins. Not a very huge improvement.

Is vertica slow with joins on large data sets? Makes one question if this is really a big data database.

select count(*)
from fact_table01 m
JOIN fact_table02 s
ON m.id = s.id
JOIN fact_table03 i
ON m.id = i.id
and m.id = i.id
;

Fact tables in the query have at least 5B records.

The funny thing is, I checked the QUERY_EVENTS and I saw that there were suggestions like "Consider using a sorted projection." Which is pretty funny considering the projection used was from the database designer.

Comments

  • Jim_KnicelyJim_Knicely - Select Field - Administrator
    edited June 2019

    Vertica works best if all of joined tables contain zero rows. Ha ha. Just kidding :smile:

    Are all of the projections involved segmented on the ID column?

    See:
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/AvoidingResegmentationDuringJoins.htm

    And are they sorted by the ID column?

    Also, for fun, try:

    select count(*) 
    from fact_table01 m 
    JOIN fact_table02 s 
    ON m.id = s.id 
    JOIN fact_table03 i 
    ON m.id = i.id 
    ;
    

    And...

    select count(1) 
    from fact_table01 m 
    JOIN fact_table02 s 
    ON m.id = s.id 
    JOIN fact_table03 i 
    ON m.id = i.id 
    ;
    

    And...

    select count(m.id) 
    from fact_table01 m 
    JOIN fact_table02 s 
    ON m.id = s.id 
    JOIN fact_table03 i 
    ON m.id = i.id 
    ;
    
  • Thanks for the response. I noticed that I made a mistake in the query I posted:

    it should have been this where the last condition is the one that was updated:

    select count(*)
    from fact_table01 m
    JOIN fact_table02 s
    ON m.s_id = s.id
    JOIN fact_table03 i
    ON m.s_id = i.s_id
    and m.i_id = i.id
    ;

    the projections of the tables:

    fact_table01
    ordered by s_id, i_id
    segmented on s_id, i_id

    fact_table02
    ordered by s_id
    segmented by s_id

    fact_table03
    ordered by s_id, i_id
    segmented by s_id, i_id

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    @abtsantosbenten - Also, maybe consider using a Vertica flattened table.

    See:
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/FlattenedTables.htm

    Note:

    Vertica 9.2 includes faster refresh of tables with multiple Live Aggregate Projections (LAP).

    FTs help deliver fast query performance on larger data sets. BTW, Flattened Tables are a Denormalized Table of (a part of) a Normalized Schema (Star Schema) or it is a Table with normalized and denormalized columns, where query performance is of essence.

    Before 9.2, if a column is referenced by any LAP, that column cannot be applied Flattened Table refresh_columns operation. In 9.2 we support refresh_columns (rebuild mode) on Flattened table with LAP defined on it.. Only aggregate LAP is supported. Not TopK or UDF.

  • Again, thanks for accommodating this issue. I'm grateful because I haven't had much attention on the issues I've posted. Your response is much appreciated.

    I like your suggestion, but I came across this discussion that said flattened tables were made for small dimension tables. In my case, the tables being joined are all fact tables with at least 5B records.

    link to the article mentioned:
    https://forum.vertica.com/discussion/240529/slow-performance-on-flattened-facts-table

  • skeswaniskeswani - Select Field - Employee

    I am just curious, can you post the output of

    explain select count(*)
    from fact_table01 m
    JOIN fact_table02 s
    ON m.s_id = s.id
    JOIN fact_table03 i
    ON m.s_id = i.s_id
    and m.i_id = i.id;

  • skeswaniskeswani - Select Field - Employee
    edited June 2019

    based on limited info, looking at your projection design, you want table01 and table03 to join first. Then the result of that join with table02.

    the first join maybe a MergeJoin (but not necessarily) and the second one will be a hash join. we might be able to force a specific join order using hints.


  • QUERY PLAN DESCRIPTION:

    Access Path:
    +-GROUPBY NOTHING [Cost: 4M, Rows: 1] (PATH ID: 1)
    | Aggregates: count()
    | Execute on: All Nodes
    | +---> JOIN MERGEJOIN(inputs presorted) [Cost: 4M, Rows: 4] (PATH ID: 2) Inner (RESEGMENT)
    | | Join Cond: (m.s_id = i.s_id) AND (m.i_id = i.id)
    | | Execute on: All Nodes
    | | +-- Outer -> STORAGE ACCESS for i [Cost: 584K, Rows: 996M] (PATH ID: 3)
    | | | Projection: myschema.fact_table03
    | | | Materialize: i.s_id, i.i_id
    | | | Execute on: All Nodes
    | | | Runtime Filters: (SIP1(MergeJoin): i.s_id), (SIP2(MergeJoin): i.id), (SIP3(MergeJoin): i.s_id, i.id)
    | | +-- Inner -> JOIN MERGEJOIN(inputs presorted) [Cost: 2M, Rows: 973M] (PUSHED GROUPING) (PATH ID: 4) Outer (RESEGMENT)
    | | | Join Cond: (m.s_id = s.id)
    | | | Execute on: All Nodes
    | | | +-- Outer -> STORAGE ACCESS for m [Cost: 701K, Rows: 973M] (PATH ID: 5)
    | | | | Projection: myschema.fact_table01
    | | | | Materialize: m.s_id, m.i_id
    | | | | Execute on: All Nodes
    | | | +-- Inner -> STORAGE ACCESS for s [Cost: 156K, Rows: 533M] (PATH ID: 6)
    | | | | Projection: myschema.fact_table02
    | | | | Materialize: s.id
    | | | | Execute on: All Nodes

  • This plan is already using a projection from DBD. The plan originally showed a hash join.

  • skeswaniskeswani - Select Field - Employee
    edited June 2019

    can you try...
    I am curious about what timing you get on this one

    vsql> select add_vertica_options('OPT', 'PlanNoMJ');
    vsql> \timing
    vsql> select count(*) from fact_table01 m JOIN fact_table02 s ON m.s_id = s.id JOIN fact_table03 i ON m.s_id = i.s_id and m.i_id = i.id;
    select clr_vertica_options('OPT', 'PlanNoMJ');

  • I can only test the performance in the PROD environment because it's the only place with the large data set.

    I need to know first what this does: add_vertica_options('OPT', 'PlanNoMJ');

    How come there are no search results for add_vertica_options in the documentation?

  • skeswaniskeswani - Select Field - Employee
    edited June 2019

    This options causes the optimizer to avoid a merge join, in lieu of a hash join. Its not a documented option. It will allow for more threads to work on the join. I just want to see the performance of this option (sometimes it helps). Fact on Fact joins can sometime benefit from this.
    When you run this query with this option, also get the explain plan.

  • it ran for almost 4 mins (3 mins 48 secs)


    QUERY PLAN DESCRIPTION:

    Opt Vertica Options

    PLAN_NO_MJ

    explain select /+add_vertica_options('OPT', 'PlanNoMJ')/ count(1) from (...removed...)

    Access Path:
    +-GROUPBY NOTHING [Cost: 8M, Rows: 1] (PATH ID: 1)
    | Aggregates: count()
    | Execute on: All Nodes
    | +---> JOIN HASH [Cost: 8M, Rows: 4] (PATH ID: 2) Inner (RESEGMENT)
    | | Join Cond: (m.s_id = i.s_id) AND (m.i_id = i.id)
    | | Execute on: All Nodes
    | | +-- Outer -> STORAGE ACCESS for i [Cost: 584K, Rows: 996M] (PATH ID: 3)
    | | | Projection: myschema.fact_table03
    | | | Materialize: i.s_id, i.id
    | | | Execute on: All Nodes
    | | | Runtime Filters: (SIP1(HashJoin): i.s_id), (SIP2(HashJoin): i.id), (SIP3(HashJoin): i.s_id, i.id)
    | | +-- Inner -> JOIN HASH [Cost: 3M, Rows: 973M] (PUSHED GROUPING) (PATH ID: 4) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
    | | | Join Cond: (m.i_id = s.i_id)
    | | | Execute on: All Nodes
    | | | +-- Outer -> STORAGE ACCESS for m [Cost: 573K, Rows: 973M] (PATH ID: 5)
    | | | | Projection: myschema.fact_table01
    | | | | Materialize: m.s_id, m.i_id
    | | | | Execute on: All Nodes
    | | | +-- Inner -> STORAGE ACCESS for s [Cost: 156K, Rows: 533M] (PATH ID: 6)
    | | | | Projection: myschema.fact_table02
    | | | | Materialize: s.id
    | | | | Execute on: All Nodes

  • skeswaniskeswani - Select Field - Employee

    Plan looks as expected with this option.
    When you get a chance can you tell me the performance of the query with this option.

  • it ran for almost 4 mins (3 mins 48 secs)

  • skeswaniskeswani - Select Field - Employee

    11 minutes to 7 mins to 4 min. some what better - right.
    Its a FACT on FACT on FACT join, so its going to be expensive. but 4 min on 5B x 5B x 5B records may ok. you can get faster by adding nodes or more cpus.

    you can use the syntatic opt and force a hash join, see info here. that is essentially what was done with the PlanNoMJ option

    see
    https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Hints/JType.htm

    Note that Merge Join is very efficient with resources. the DBD it trying to optimize the query using the least possible resources. The plan that yields 4 min is not the most efficient, but is faster.

  • Jim_KnicelyJim_Knicely - Select Field - Administrator

    Fyi... We have clients that have created wide flat tables (not Vertica Flattened Tables) to avoid joins on massive tables at runtime. So, your 3 fact tables could be a single table.

  • actually in a "good day", low workload in the database, I saw this query run as 1 min 33 seconds.

    but still doesn't beat the previous database that was able to run this at 13 seconds.

    I'll explore the flattened/de-normalized.

  • edited June 2019

    So far, what I've gathered here is that scaled fact on fact table joins is a limitation of Vertica. And the option of this type of scenario are:

    • flatten the tables
    • add more hardware resources

    is this a correct conclusion?

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file