Vertica joins on large data sets is extremely slow
version: 9.0
8 nodes
170GB of memory available for transactions per node
I have this simple join query that runs for 11 minutes.
After implementing the suggested projections I got from the database designer, it was reduced to 7 mins. Not a very huge improvement.
Is vertica slow with joins on large data sets? Makes one question if this is really a big data database.
select count(*)
from fact_table01 m
JOIN fact_table02 s
ON m.id = s.id
JOIN fact_table03 i
ON m.id = i.id
and m.id = i.id
;
Fact tables in the query have at least 5B records.
The funny thing is, I checked the QUERY_EVENTS and I saw that there were suggestions like "Consider using a sorted projection." Which is pretty funny considering the projection used was from the database designer.
Comments
Vertica works best if all of joined tables contain zero rows. Ha ha. Just kidding
Are all of the projections involved segmented on the ID column?
See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/AvoidingResegmentationDuringJoins.htm
And are they sorted by the ID column?
Also, for fun, try:
And...
And...
Thanks for the response. I noticed that I made a mistake in the query I posted:
it should have been this where the last condition is the one that was updated:
select count(*)
from fact_table01 m
JOIN fact_table02 s
ON m.s_id = s.id
JOIN fact_table03 i
ON m.s_id = i.s_id
and m.i_id = i.id
;
the projections of the tables:
fact_table01
ordered by s_id, i_id
segmented on s_id, i_id
fact_table02
ordered by s_id
segmented by s_id
fact_table03
ordered by s_id, i_id
segmented by s_id, i_id
@abtsantosbenten - Also, maybe consider using a Vertica flattened table.
See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/FlattenedTables.htm
Note:
Vertica 9.2 includes faster refresh of tables with multiple Live Aggregate Projections (LAP).
FTs help deliver fast query performance on larger data sets. BTW, Flattened Tables are a Denormalized Table of (a part of) a Normalized Schema (Star Schema) or it is a Table with normalized and denormalized columns, where query performance is of essence.
Before 9.2, if a column is referenced by any LAP, that column cannot be applied Flattened Table refresh_columns operation. In 9.2 we support refresh_columns (rebuild mode) on Flattened table with LAP defined on it.. Only aggregate LAP is supported. Not TopK or UDF.
Again, thanks for accommodating this issue. I'm grateful because I haven't had much attention on the issues I've posted. Your response is much appreciated.
I like your suggestion, but I came across this discussion that said flattened tables were made for small dimension tables. In my case, the tables being joined are all fact tables with at least 5B records.
link to the article mentioned:
https://forum.vertica.com/discussion/240529/slow-performance-on-flattened-facts-table
I am just curious, can you post the output of
based on limited info, looking at your projection design, you want table01 and table03 to join first. Then the result of that join with table02.
the first join maybe a MergeJoin (but not necessarily) and the second one will be a hash join. we might be able to force a specific join order using hints.
QUERY PLAN DESCRIPTION:
Access Path:
+-GROUPBY NOTHING [Cost: 4M, Rows: 1] (PATH ID: 1)
| Aggregates: count()
| Execute on: All Nodes
| +---> JOIN MERGEJOIN(inputs presorted) [Cost: 4M, Rows: 4] (PATH ID: 2) Inner (RESEGMENT)
| | Join Cond: (m.s_id = i.s_id) AND (m.i_id = i.id)
| | Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for i [Cost: 584K, Rows: 996M] (PATH ID: 3)
| | | Projection: myschema.fact_table03
| | | Materialize: i.s_id, i.i_id
| | | Execute on: All Nodes
| | | Runtime Filters: (SIP1(MergeJoin): i.s_id), (SIP2(MergeJoin): i.id), (SIP3(MergeJoin): i.s_id, i.id)
| | +-- Inner -> JOIN MERGEJOIN(inputs presorted) [Cost: 2M, Rows: 973M] (PUSHED GROUPING) (PATH ID: 4) Outer (RESEGMENT)
| | | Join Cond: (m.s_id = s.id)
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for m [Cost: 701K, Rows: 973M] (PATH ID: 5)
| | | | Projection: myschema.fact_table01
| | | | Materialize: m.s_id, m.i_id
| | | | Execute on: All Nodes
| | | +-- Inner -> STORAGE ACCESS for s [Cost: 156K, Rows: 533M] (PATH ID: 6)
| | | | Projection: myschema.fact_table02
| | | | Materialize: s.id
| | | | Execute on: All Nodes
This plan is already using a projection from DBD. The plan originally showed a hash join.
can you try...
I am curious about what timing you get on this one
I can only test the performance in the PROD environment because it's the only place with the large data set.
I need to know first what this does: add_vertica_options('OPT', 'PlanNoMJ');
How come there are no search results for add_vertica_options in the documentation?
This options causes the optimizer to avoid a merge join, in lieu of a hash join. Its not a documented option. It will allow for more threads to work on the join. I just want to see the performance of this option (sometimes it helps). Fact on Fact joins can sometime benefit from this.
When you run this query with this option, also get the explain plan.
it ran for almost 4 mins (3 mins 48 secs)
QUERY PLAN DESCRIPTION:
Opt Vertica Options
PLAN_NO_MJ
explain select /+add_vertica_options('OPT', 'PlanNoMJ')/ count(1) from (...removed...)
Access Path:
+-GROUPBY NOTHING [Cost: 8M, Rows: 1] (PATH ID: 1)
| Aggregates: count()
| Execute on: All Nodes
| +---> JOIN HASH [Cost: 8M, Rows: 4] (PATH ID: 2) Inner (RESEGMENT)
| | Join Cond: (m.s_id = i.s_id) AND (m.i_id = i.id)
| | Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for i [Cost: 584K, Rows: 996M] (PATH ID: 3)
| | | Projection: myschema.fact_table03
| | | Materialize: i.s_id, i.id
| | | Execute on: All Nodes
| | | Runtime Filters: (SIP1(HashJoin): i.s_id), (SIP2(HashJoin): i.id), (SIP3(HashJoin): i.s_id, i.id)
| | +-- Inner -> JOIN HASH [Cost: 3M, Rows: 973M] (PUSHED GROUPING) (PATH ID: 4) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
| | | Join Cond: (m.i_id = s.i_id)
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for m [Cost: 573K, Rows: 973M] (PATH ID: 5)
| | | | Projection: myschema.fact_table01
| | | | Materialize: m.s_id, m.i_id
| | | | Execute on: All Nodes
| | | +-- Inner -> STORAGE ACCESS for s [Cost: 156K, Rows: 533M] (PATH ID: 6)
| | | | Projection: myschema.fact_table02
| | | | Materialize: s.id
| | | | Execute on: All Nodes
Plan looks as expected with this option.
When you get a chance can you tell me the performance of the query with this option.
it ran for almost 4 mins (3 mins 48 secs)
11 minutes to 7 mins to 4 min. some what better - right.
Its a FACT on FACT on FACT join, so its going to be expensive. but 4 min on 5B x 5B x 5B records may ok. you can get faster by adding nodes or more cpus.
you can use the syntatic opt and force a hash join, see info here. that is essentially what was done with the PlanNoMJ option
see
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Hints/JType.htm
Note that Merge Join is very efficient with resources. the DBD it trying to optimize the query using the least possible resources. The plan that yields 4 min is not the most efficient, but is faster.
Fyi... We have clients that have created wide flat tables (not Vertica Flattened Tables) to avoid joins on massive tables at runtime. So, your 3 fact tables could be a single table.
actually in a "good day", low workload in the database, I saw this query run as 1 min 33 seconds.
but still doesn't beat the previous database that was able to run this at 13 seconds.
I'll explore the flattened/de-normalized.
So far, what I've gathered here is that scaled fact on fact table joins is a limitation of Vertica. And the option of this type of scenario are:
is this a correct conclusion?