slow multi-join query for small database (~120K total rows, 10 minutes)
Hi Folks. I'd love to get your help understanding why this join-heavy query takes about 10 minutes to run on a small database of seven tables totaling < 120K rows, and ideally get your suggestions on how to make it faster on our little cluster of four nodes. I've put supporting information at https://gist.github.com/anonymous/8862796 (list of tables, list of fields by table, and table sizes), but following is the query and the EXPLAIN VERBOSE output. I ran ANALYZE_WORKLOAD() on this query and then followed its suggestion to run ANALYZE_STATISTICS on all tables. This resulted in no improvement. I then did its second suggestion of running the Database Designer, which resulted in even slower performance. I'd very much appreciate your help.
0
Comments
------------------------------
QUERY PLAN DESCRIPTION:
------------------------------
Opt Vertica Options
--------------------
PLAN_OUTPUT_SUPER_VERBOSE
EXPLAIN VERBOSE
SELECT relVarTable0.id AS id, relVarTable1.val, relVarTable2.val
FROM (SELECT id FROM rates) relVarTable0
LEFT JOIN
(SELECT rates1.id AS id, AVG(rates4.rating) AS val
FROM rates rates1, movie movie1, rates rates2, ml_user ml_user1, rates rates3, movie movie2, rates rates4
WHERE movie1.id = rates1.movie_id AND movie1.id = rates2.movie_id AND ml_user1.id = rates2.ml_user_id AND ml_user1.id = rates3.ml_user_id AND movie2.id = rates3.movie_id AND movie2.id = rates4.movie_id AND movie1.id <> movie2.id AND rates1.id <> rates2.id AND rates2.id <> rates3.id AND rates3.id <> rates4.id AND rates4.rating IS NOT NULL
GROUP BY rates1.id) relVarTable1
ON relVarTable0.id = relVarTable1.id
LEFT JOIN
(SELECT rates1.id AS id, rates1.rating AS val
FROM rates rates1
WHERE rates1.rating IS NOT NULL ) relVarTable2
ON relVarTable0.id = relVarTable2.id;
Access Path:
Sort Key: (V(1,1))
LDISTRIB_UNSEGMENTED
+-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4489.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5441368.000000 Memory(B): 1209184.000000 Netwrk(B): 1209184.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 1) Inner (RESEGMENT)
| Join Cond: (relVarTable0.id = relVarTable2.id)
| Execute on: All Nodes
| Sort Key: (V(1,1))
| LDISTRIB_UNSEGMENTED
| +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4197.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 1369200.000000 Memory(B): 0.000000 Netwrk(B): 604600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 2) Outer (RESEGMENT)
| | Join Cond: (relVarTable0.id = relVarTable1.id)
| | Execute on: All Nodes
| | Sort Key: (V(1,1))
| | LDISTRIB_UNSEGMENTED
| | +-- Outer -> SELECT [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 3)
| | | Execute on: All Nodes
| | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | LDISTRIB_UNSEGMENTED
| | | +---> STORAGE ACCESS for rates [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 4)
| | | | Column Cost Aspects: [ Disk(B): 196608.000000 CPU(B): 0.000000 Memory(B): 604600.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | Projection: movielens_test.rates_b0
| | | | Materialize: rates.id
| | | | Execute on: All Nodes
| | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | LDISTRIB_SEGMENTED
| | +-- Inner -> SELECT [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 5)
| | | Execute on: All Nodes
| | | Sort Key: (rates.id)
| | | LDISTRIB_UNSEGMENTED
| | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 6650600.000000 Memory(B): 640000.000000 Netwrk(B): 6890600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 6)
| | | | Aggregates: sum_float(<SVAR>), count(<SVAR>)
| | | | Group By: rates1.id
| | | | Execute on: All Nodes
| | | | Sort Key: (rates.id)
| | | | LDISTRIB_SEGMENTED
| | | | +---> JOIN HASH [Cost: 2869.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 12091944.000000 Memory(B): 3022960.000000 Netwrk(B): 1813776.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 88] (PATH ID: 7) Inner (RESEGMENT)
| | | | | Join Cond: (movie2.id = rates4.movie_id)
| | | | | Join Filter: (rates3.id <> rates4.id)
| | | | | Execute on: All Nodes
| | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | LDISTRIB_UNSEGMENTED
| | | | | +-- Outer -> JOIN HASH [Cost: 2395.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 9110592.000000 Memory(B): 41592.000000 Netwrk(B): 4246064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 64] (PATH ID: 8) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
| | | | | | Join Cond: (movie2.id = rates3.movie_id)
| | | | | | Join Filter: (movie1.id <> movie2.id)
| | | | | | Execute on: All Nodes
| | | | | | Runtime Filter: (SIP1(HashJoin): movie2.id)
| | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | LDISTRIB_SEGMENTED
| | | | | | +-- Outer -> JOIN HASH [Cost: 1625.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 10278200.000000 Memory(B): 3023000.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 56] (PATH ID: 9) Inner (RESEGMENT)
| | | | | | | Join Cond: (ml_user1.id = rates3.ml_user_id)
| | | | | | | Join Filter: (rates2.id <> rates3.id)
| | | | | | | Execute on: All Nodes
| | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | LDISTRIB_UNSEGMENTED
| | | | | | | +-- Outer -> JOIN HASH [Cost: 1163.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5582544.000000 Memory(B): 141144.000000 Netwrk(B): 2465448.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 10) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
| | | | | | | | Join Cond: (ml_user1.id = rates2.ml_user_id)
| | | | | | | | Execute on: All Nodes
| | | | | | | | Runtime Filter: (SIP2(HashJoin): ml_user1.id)
| | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | +-- Outer -> JOIN HASH [Cost: 711.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 8464400.000000 Memory(B): 2418400.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 32] (PATH ID: 11) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
| | | | | | | | | Join Cond: (movie1.id = rates2.movie_id)
| | | | | | | | | Join Filter: (rates1.id <> rates2.id)
| | | | | | | | | Execute on: All Nodes
| | | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | | +-- Outer -> STORAGE ACCESS for rates2 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 12)
| | | | | | | | | | Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | | | Projection: movielens_test.rates_b0
| | | | | | | | | | Materialize: rates2.id, rates2.ml_user_id, rates2.movie_id
| | | | | | | | | | Execute on: All Nodes
| | | | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | | +-- Inner -> JOIN HASH [Cost: 268.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 3064592.000000 Memory(B): 41592.000000 Netwrk(B): 1223064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 13) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
| | | | | | | | | | Join Cond: (movie1.id = rates1.movie_id)
| | | | | | | | | | Execute on: All Nodes
| | | | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates1 [Cost: 39.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 14)
| | | | | | | | | | | Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 0.000000 Memory(B): 1209200.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | | | | Projection: movielens_test.rates_b0
| | | | | | | | | | | Materialize: rates1.id, rates1.movie_id
| | | | | | | | | | | Execute on: All Nodes
| | | | | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | | | +-- Inner -> STORAGE ACCESS for movie1 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 15)
| | | | | | | | | | | Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | | | | Projection: movielens_test.movie_b0
| | | | | | | | | | | Materialize: movie1.id
| | | | | | | | | | | Execute on: All Nodes
| | | | | | | | | | | Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
| | | | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | | +-- Inner -> STORAGE ACCESS for ml_user1 [Cost: 5.000000, Rows: 5881.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 16)
| | | | | | | | | Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 47048.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | | Projection: movielens_test.ml_user_b0
| | | | | | | | | Materialize: ml_user1.id
| | | | | | | | | Execute on: All Nodes
| | | | | | | | | Sort Key: (ml_user.id, ml_user.gender, ml_user.age_range, ml_user.occupation, ml_user.zipcode, ml_user.num_ratings)
| | | | | | | | | LDISTRIB_SEGMENTED
| | | | | | | +-- Inner -> STORAGE ACCESS for rates3 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 17)
| | | | | | | | Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | | Projection: movielens_test.rates_b0
| | | | | | | | Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id
| | | | | | | | Execute on: All Nodes
| | | | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | | | LDISTRIB_SEGMENTED
| | | | | | +-- Inner -> STORAGE ACCESS for movie2 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 18)
| | | | | | | Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | | Projection: movielens_test.movie_b0
| | | | | | | Materialize: movie2.id
| | | | | | | Execute on: All Nodes
| | | | | | | Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
| | | | | | | LDISTRIB_SEGMENTED
| | | | | +-- Inner -> STORAGE ACCESS for rates4 [Cost: 60.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PUSHED GROUPING) Partial GroupBy: rates4.movie_id,rates4.id Partial Aggs: sum_float(<SVAR>),count(<SVAR>) (PATH ID: 19)
| | | | | | Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 196608.000000 Memory(B): 1813784.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | | | | Projection: movielens_test.rates_b0
| | | | | | Materialize: rates4.rating, rates4.id, rates4.movie_id
| | | | | | Filter: (rates4.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
| | | | | | Execute on: All Nodes
| | | | | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | | | | LDISTRIB_SEGMENTED
| +-- Inner -> SELECT [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 20)
| | Execute on: All Nodes
| | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | LDISTRIB_UNSEGMENTED
| | +---> STORAGE ACCESS for rates1 [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 21)
| | | Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 196608.000000 Memory(B): 1209184.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
| | | Projection: movielens_test.rates_b0
| | | Materialize: rates1.rating, rates1.id
| | | Filter: (rates1.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
| | | Execute on: All Nodes
| | | Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
| | | LDISTRIB_SEGMENTED
------------------------------
There are so many variables you can look into - Am I correct that the tables/super-projection are Segmented? Try putting them on to one node with plenty of memory (8G+)
If you like to see which JOIN slows things down the most, maybe you can tweak with number of tables to join or the join condition, just to see if anything improves.
Lastly, if you have the resources, put this query into a ROW-based database, e.g. MySQL or Oracle (a la Amazon's RDS) Start an instance with enough memory to put the whole table in. It won't surprise me that the kind of query you are running may do well with a ROW-based DB that aggressively take advantage of memory allocation.
Hope this helps,
Bing
Just a brief comment. I see "NO STATISTICS" showing up very often in the explain. That's not helping. So, running analyze_statistics is going to help here.
Make sure the PK and FK relationships are defined. Vertica doesn't enforce them, but it will help the optimizer make good decisions.
I see RESEGMENT showing up frequently. That tells me the segmentation definition might not be consistent. Tables that are joined, which are segmented, should be segmented in the same way in order to ensure that the joins happen locally - and data isn't being moved around the cluster.
The "<>" are not helping. These are inefficient by design. If there was a way to remove those, that would likely speed it up a lot.
Hope that helps!
> Just a brief comment. I see "NO STATISTICS" showing up very often in the explain. That's not helping. So, running analyze_statistics is going to help here.
Yes, I looked at that and found that, at least for this small database, the planner's row estimates were perfect. Thus the performance with and without statisties was the same. Good call, though.
> Make sure the PK and FK relationships are defined. Vertica doesn't enforce them, but it will help the optimizer make good decisions.
I hadn't thought about that, but I just verified that they are defined.
> I see RESEGMENT showing up frequently. That tells me the segmentation definition might not be consistent. Tables that are joined, which are segmented, should be segmented in the same way in order to ensure that the joins happen locally - and data isn't being moved around the cluster.
Thanks for bringing this up - it is still confusing to me. I'm using the default segmentation that Vertica chooses when I create the database from scratch. I've put the 'SELECT export_objects' output at https://gist.github.com/anonymous/9891527 if you want to take a look at the CREATE PROJECTION statements. They are all 'SEGMENTED BY hash'. For the three tables in the query: I would /love/ to hear your suggestions on how to change this to be faster.
> The "<>" are not helping. These are inefficient by design. If there was a way to remove those, that would likely speed it up a lot.
Yes, I wish I could eliminate those (I read your docs about how this punts the planner into hashes), but it's the only way we know of to get this type of query to be correct. Like I tried to explain below, it's kind of misusing the database to walk along all 'paths' starting with rates ids and then calculating an aggregation one the last id. In this case: "[rates, movie, rates, ml_user, rates, movie, rates].rating". It's part of our machine learning research, which partly explains the oddball use of RDBMS's. Not sure that makes sense...
Back to your point, the inequalities ensure we don't reuse backtrack through ids we've seen before.
Thanks!