Vertica joins on large data sets is extremely slow

abtsantosbenten · June 2019

version: 9.0
8 nodes
170GB of memory available for transactions per node

I have this simple join query that runs for 11 minutes.
After implementing the suggested projections I got from the database designer, it was reduced to 7 mins. Not a very huge improvement.

Is vertica slow with joins on large data sets? Makes one question if this is really a big data database.

select count(*)
from fact_table01 m
JOIN fact_table02 s
ON m.id = s.id
JOIN fact_table03 i
ON m.id = i.id
and m.id = i.id
;

Fact tables in the query have at least 5B records.

The funny thing is, I checked the QUERY_EVENTS and I saw that there were suggestions like "Consider using a sorted projection." Which is pretty funny considering the projection used was from the database designer.

Jim_Knicely · June 2019

Vertica works best if all of joined tables contain zero rows. Ha ha. Just kidding

Are all of the projections involved segmented on the ID column?

See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/AvoidingResegmentationDuringJoins.htm

And are they sorted by the ID column?

Also, for fun, try:

select count(*) 
from fact_table01 m 
JOIN fact_table02 s 
ON m.id = s.id 
JOIN fact_table03 i 
ON m.id = i.id 
;

And...

select count(1) 
from fact_table01 m 
JOIN fact_table02 s 
ON m.id = s.id 
JOIN fact_table03 i 
ON m.id = i.id 
;

And...

select count(m.id) 
from fact_table01 m 
JOIN fact_table02 s 
ON m.id = s.id 
JOIN fact_table03 i 
ON m.id = i.id 
;

abtsantosbenten · June 2019

Thanks for the response. I noticed that I made a mistake in the query I posted:

it should have been this where the last condition is the one that was updated:

select count(*)
from fact_table01 m
JOIN fact_table02 s
ON m.s_id = s.id
JOIN fact_table03 i
ON m.s_id = i.s_id
and m.i_id = i.id
;

the projections of the tables:

fact_table01
ordered by s_id, i_id
segmented on s_id, i_id

fact_table02
ordered by s_id
segmented by s_id

fact_table03
ordered by s_id, i_id
segmented by s_id, i_id

Jim_Knicely · June 2019

@abtsantosbenten - Also, maybe consider using a Vertica flattened table.

See:
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/FlattenedTables/FlattenedTables.htm

Note:

Vertica 9.2 includes faster refresh of tables with multiple Live Aggregate Projections (LAP).

FTs help deliver fast query performance on larger data sets. BTW, Flattened Tables are a Denormalized Table of (a part of) a Normalized Schema (Star Schema) or it is a Table with normalized and denormalized columns, where query performance is of essence.

Before 9.2, if a column is referenced by any LAP, that column cannot be applied Flattened Table refresh_columns operation. In 9.2 we support refresh_columns (rebuild mode) on Flattened table with LAP defined on it.. Only aggregate LAP is supported. Not TopK or UDF.

abtsantosbenten · June 2019

Again, thanks for accommodating this issue. I'm grateful because I haven't had much attention on the issues I've posted. Your response is much appreciated.

I like your suggestion, but I came across this discussion that said flattened tables were made for small dimension tables. In my case, the tables being joined are all fact tables with at least 5B records.

link to the article mentioned:
https://forum.vertica.com/discussion/240529/slow-performance-on-flattened-facts-table

skeswani · June 2019

I am just curious, can you post the output of

explain select count(*)
from fact_table01 m
JOIN fact_table02 s
ON m.s_id = s.id
JOIN fact_table03 i
ON m.s_id = i.s_id
and m.i_id = i.id;

skeswani · June 2019

based on limited info, looking at your projection design, you want table01 and table03 to join first. Then the result of that join with table02.

the first join maybe a MergeJoin (but not necessarily) and the second one will be a hash join. we might be able to force a specific join order using hints.

abtsantosbenten · June 2019

QUERY PLAN DESCRIPTION:

Access Path:
+-GROUPBY NOTHING [Cost: 4M, Rows: 1] (PATH ID: 1)
| Aggregates: count()
| Execute on: All Nodes
| +---> JOIN MERGEJOIN(inputs presorted) [Cost: 4M, Rows: 4] (PATH ID: 2) Inner (RESEGMENT)
| | Join Cond: (m.s_id = i.s_id) AND (m.i_id = i.id)
| | Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for i [Cost: 584K, Rows: 996M] (PATH ID: 3)
| | | Projection: myschema.fact_table03
| | | Materialize: i.s_id, i.i_id
| | | Execute on: All Nodes
| | | Runtime Filters: (SIP1(MergeJoin): i.s_id), (SIP2(MergeJoin): i.id), (SIP3(MergeJoin): i.s_id, i.id)
| | +-- Inner -> JOIN MERGEJOIN(inputs presorted) [Cost: 2M, Rows: 973M] (PUSHED GROUPING) (PATH ID: 4) Outer (RESEGMENT)
| | | Join Cond: (m.s_id = s.id)
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for m [Cost: 701K, Rows: 973M] (PATH ID: 5)
| | | | Projection: myschema.fact_table01
| | | | Materialize: m.s_id, m.i_id
| | | | Execute on: All Nodes
| | | +-- Inner -> STORAGE ACCESS for s [Cost: 156K, Rows: 533M] (PATH ID: 6)
| | | | Projection: myschema.fact_table02
| | | | Materialize: s.id
| | | | Execute on: All Nodes

abtsantosbenten · June 2019

This plan is already using a projection from DBD. The plan originally showed a hash join.

skeswani · June 2019

can you try...
I am curious about what timing you get on this one

vsql> select add_vertica_options('OPT', 'PlanNoMJ');
vsql> \timing
vsql> select count(*) from fact_table01 m JOIN fact_table02 s ON m.s_id = s.id JOIN fact_table03 i ON m.s_id = i.s_id and m.i_id = i.id;
select clr_vertica_options('OPT', 'PlanNoMJ');

abtsantosbenten · June 2019

I can only test the performance in the PROD environment because it's the only place with the large data set.

I need to know first what this does: add_vertica_options('OPT', 'PlanNoMJ');

How come there are no search results for add_vertica_options in the documentation?

skeswani · June 2019

This options causes the optimizer to avoid a merge join, in lieu of a hash join. Its not a documented option. It will allow for more threads to work on the join. I just want to see the performance of this option (sometimes it helps). Fact on Fact joins can sometime benefit from this.
When you run this query with this option, also get the explain plan.

abtsantosbenten · June 2019

it ran for almost 4 mins (3 mins 48 secs)

QUERY PLAN DESCRIPTION:

Opt Vertica Options

PLAN_NO_MJ

explain select /+add_vertica_options('OPT', 'PlanNoMJ')/ count(1) from (...removed...)

Access Path:
+-GROUPBY NOTHING [Cost: 8M, Rows: 1] (PATH ID: 1)
| Aggregates: count()
| Execute on: All Nodes
| +---> JOIN HASH [Cost: 8M, Rows: 4] (PATH ID: 2) Inner (RESEGMENT)
| | Join Cond: (m.s_id = i.s_id) AND (m.i_id = i.id)
| | Execute on: All Nodes
| | +-- Outer -> STORAGE ACCESS for i [Cost: 584K, Rows: 996M] (PATH ID: 3)
| | | Projection: myschema.fact_table03
| | | Materialize: i.s_id, i.id
| | | Execute on: All Nodes
| | | Runtime Filters: (SIP1(HashJoin): i.s_id), (SIP2(HashJoin): i.id), (SIP3(HashJoin): i.s_id, i.id)
| | +-- Inner -> JOIN HASH [Cost: 3M, Rows: 973M] (PUSHED GROUPING) (PATH ID: 4) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
| | | Join Cond: (m.i_id = s.i_id)
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for m [Cost: 573K, Rows: 973M] (PATH ID: 5)
| | | | Projection: myschema.fact_table01
| | | | Materialize: m.s_id, m.i_id
| | | | Execute on: All Nodes
| | | +-- Inner -> STORAGE ACCESS for s [Cost: 156K, Rows: 533M] (PATH ID: 6)
| | | | Projection: myschema.fact_table02
| | | | Materialize: s.id
| | | | Execute on: All Nodes

skeswani · June 2019

Plan looks as expected with this option.
When you get a chance can you tell me the performance of the query with this option.

abtsantosbenten · June 2019

it ran for almost 4 mins (3 mins 48 secs)

skeswani · June 2019

11 minutes to 7 mins to 4 min. some what better - right.
Its a FACT on FACT on FACT join, so its going to be expensive. but 4 min on 5B x 5B x 5B records may ok. you can get faster by adding nodes or more cpus.

you can use the syntatic opt and force a hash join, see info here. that is essentially what was done with the PlanNoMJ option

see
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Hints/JType.htm

Note that Merge Join is very efficient with resources. the DBD it trying to optimize the query using the least possible resources. The plan that yields 4 min is not the most efficient, but is faster.

Jim_Knicely · June 2019

Fyi... We have clients that have created wide flat tables (not Vertica Flattened Tables) to avoid joins on massive tables at runtime. So, your 3 fact tables could be a single table.

abtsantosbenten · June 2019

actually in a "good day", low workload in the database, I saw this query run as 1 min 33 seconds.

but still doesn't beat the previous database that was able to run this at 13 seconds.

I'll explore the flattened/de-normalized.

abtsantosbenten · June 2019

So far, what I've gathered here is that scaled fact on fact table joins is a limitation of Vertica. And the option of this type of scenario are:

flatten the tables
add more hardware resources

is this a correct conclusion?

Vertica joins on large data sets is extremely slow

Comments

QUERY PLAN DESCRIPTION:

QUERY PLAN DESCRIPTION:

Opt Vertica Options

Leave a Comment