Options

slow multi-join query for small database (~120K total rows, 10 minutes)

Hi Folks. I'd love to get your help understanding why this join-heavy query takes about 10 minutes to run on a small database of seven tables totaling < 120K rows, and ideally get your suggestions on how to make it faster on our little cluster of four nodes. I've put supporting information at https://gist.github.com/anonymous/8862796 (list of tables, list of fields by table, and table sizes), but following is the query and the EXPLAIN VERBOSE output. I ran ANALYZE_WORKLOAD() on this query and then followed its suggestion to run ANALYZE_STATISTICS on all tables. This resulted in no improvement. I then did its second suggestion of running the Database Designer, which resulted in even slower performance. I'd very much appreciate your help.


Comments

  • Options
    Here's the EXPLAIN:

     ------------------------------
     QUERY PLAN DESCRIPTION:
     ------------------------------
     
     Opt Vertica Options
     --------------------
     PLAN_OUTPUT_SUPER_VERBOSE
     
     
     EXPLAIN VERBOSE
     SELECT relVarTable0.id AS id, relVarTable1.val, relVarTable2.val
     FROM (SELECT id FROM rates) relVarTable0
     LEFT JOIN
     (SELECT rates1.id AS id, AVG(rates4.rating) AS val
     FROM rates rates1, movie movie1, rates rates2, ml_user ml_user1, rates rates3, movie movie2, rates rates4
     WHERE movie1.id = rates1.movie_id AND movie1.id = rates2.movie_id AND ml_user1.id = rates2.ml_user_id AND ml_user1.id = rates3.ml_user_id AND movie2.id = rates3.movie_id AND movie2.id = rates4.movie_id AND movie1.id <> movie2.id AND rates1.id <> rates2.id AND rates2.id <> rates3.id AND rates3.id <> rates4.id AND rates4.rating IS NOT NULL
     GROUP BY rates1.id) relVarTable1
      ON relVarTable0.id = relVarTable1.id
     LEFT JOIN
     (SELECT rates1.id AS id, rates1.rating AS val
     FROM rates rates1
     WHERE rates1.rating IS NOT NULL ) relVarTable2
      ON relVarTable0.id = relVarTable2.id;
     
     Access Path:
     Sort Key: (V(1,1))
      LDISTRIB_UNSEGMENTED
     +-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4489.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5441368.000000 Memory(B): 1209184.000000 Netwrk(B): 1209184.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 1) Inner (RESEGMENT)
     |  Join Cond: (relVarTable0.id = relVarTable2.id)
     |  Execute on: All Nodes
     |  Sort Key: (V(1,1))
     |  LDISTRIB_UNSEGMENTED
     | +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4197.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 1369200.000000 Memory(B): 0.000000 Netwrk(B): 604600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 2) Outer (RESEGMENT)
     | |      Join Cond: (relVarTable0.id = relVarTable1.id)
     | |      Execute on: All Nodes
     | |      Sort Key: (V(1,1))
     | |      LDISTRIB_UNSEGMENTED
     | | +-- Outer -> SELECT [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 3)
     | | |      Execute on: All Nodes
     | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | |      LDISTRIB_UNSEGMENTED
     | | | +---> STORAGE ACCESS for rates [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 4)
     | | | |      Column Cost Aspects: [ Disk(B): 196608.000000 CPU(B): 0.000000 Memory(B): 604600.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | | |      Projection: movielens_test.rates_b0
     | | | |      Materialize: rates.id
     | | | |      Execute on: All Nodes
     | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | |      LDISTRIB_SEGMENTED
     | | +-- Inner -> SELECT [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 5)
     | | |      Execute on: All Nodes
     | | |      Sort Key: (rates.id)
     | | |      LDISTRIB_UNSEGMENTED
     | | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 6650600.000000 Memory(B): 640000.000000 Netwrk(B): 6890600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 6)
     | | | |      Aggregates: sum_float(<SVAR>), count(<SVAR>)
     | | | |      Group By: rates1.id
     | | | |      Execute on: All Nodes
     | | | |      Sort Key: (rates.id)
     | | | |      LDISTRIB_SEGMENTED
     | | | | +---> JOIN HASH [Cost: 2869.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 12091944.000000 Memory(B): 3022960.000000 Netwrk(B): 1813776.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 88] (PATH ID: 7) Inner (RESEGMENT)
     | | | | |      Join Cond: (movie2.id = rates4.movie_id)
     | | | | |      Join Filter: (rates3.id <> rates4.id)
     | | | | |      Execute on: All Nodes
     | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | |      LDISTRIB_UNSEGMENTED
     | | | | | +-- Outer -> JOIN HASH [Cost: 2395.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 9110592.000000 Memory(B): 41592.000000 Netwrk(B): 4246064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 64] (PATH ID: 8) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
     | | | | | |      Join Cond: (movie2.id = rates3.movie_id)
     | | | | | |      Join Filter: (movie1.id <> movie2.id)
     | | | | | |      Execute on: All Nodes
     | | | | | |      Runtime Filter: (SIP1(HashJoin): movie2.id)
     | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | +-- Outer -> JOIN HASH [Cost: 1625.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 10278200.000000 Memory(B): 3023000.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 56] (PATH ID: 9) Inner (RESEGMENT)
     | | | | | | |      Join Cond: (ml_user1.id = rates3.ml_user_id)
     | | | | | | |      Join Filter: (rates2.id <> rates3.id)
     | | | | | | |      Execute on: All Nodes
     | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | | |      LDISTRIB_UNSEGMENTED
     | | | | | | | +-- Outer -> JOIN HASH [Cost: 1163.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5582544.000000 Memory(B): 141144.000000 Netwrk(B): 2465448.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 10) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
     | | | | | | | |      Join Cond: (ml_user1.id = rates2.ml_user_id)
     | | | | | | | |      Execute on: All Nodes
     | | | | | | | |      Runtime Filter: (SIP2(HashJoin): ml_user1.id)
     | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | | | +-- Outer -> JOIN HASH [Cost: 711.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 8464400.000000 Memory(B): 2418400.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 32] (PATH ID: 11) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
     | | | | | | | | |      Join Cond: (movie1.id = rates2.movie_id)
     | | | | | | | | |      Join Filter: (rates1.id <> rates2.id)
     | | | | | | | | |      Execute on: All Nodes
     | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates2 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 12)
     | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | | | | | | | | |      Projection: movielens_test.rates_b0
     | | | | | | | | | |      Materialize: rates2.id, rates2.ml_user_id, rates2.movie_id
     | | | | | | | | | |      Execute on: All Nodes
     | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | | | | +-- Inner -> JOIN HASH [Cost: 268.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 3064592.000000 Memory(B): 41592.000000 Netwrk(B): 1223064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 13) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
     | | | | | | | | | |      Join Cond: (movie1.id = rates1.movie_id)
     | | | | | | | | | |      Execute on: All Nodes
     | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates1 [Cost: 39.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 14)
     | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 0.000000 Memory(B): 1209200.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | | | | | | | | | |      Projection: movielens_test.rates_b0
     | | | | | | | | | | |      Materialize: rates1.id, rates1.movie_id
     | | | | | | | | | | |      Execute on: All Nodes
     | | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | | | | | +-- Inner -> STORAGE ACCESS for movie1 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 15)
     | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | | | | | | | | | |      Projection: movielens_test.movie_b0
     | | | | | | | | | | |      Materialize: movie1.id
     | | | | | | | | | | |      Execute on: All Nodes
     | | | | | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
     | | | | | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | | | +-- Inner -> STORAGE ACCESS for ml_user1 [Cost: 5.000000, Rows: 5881.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 16)
     | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 47048.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | | | | | | | |      Projection: movielens_test.ml_user_b0
     | | | | | | | | |      Materialize: ml_user1.id
     | | | | | | | | |      Execute on: All Nodes
     | | | | | | | | |      Sort Key: (ml_user.id, ml_user.gender, ml_user.age_range, ml_user.occupation, ml_user.zipcode, ml_user.num_ratings)
     | | | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | | +-- Inner -> STORAGE ACCESS for rates3 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 17)
     | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | | | | | | |      Projection: movielens_test.rates_b0
     | | | | | | | |      Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id
     | | | | | | | |      Execute on: All Nodes
     | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | | +-- Inner -> STORAGE ACCESS for movie2 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 18)
     | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | | | | | |      Projection: movielens_test.movie_b0
     | | | | | | |      Materialize: movie2.id
     | | | | | | |      Execute on: All Nodes
     | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
     | | | | | | |      LDISTRIB_SEGMENTED
     | | | | | +-- Inner -> STORAGE ACCESS for rates4 [Cost: 60.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PUSHED GROUPING) Partial GroupBy: rates4.movie_id,rates4.id Partial Aggs: sum_float(<SVAR>),count(<SVAR>) (PATH ID: 19)
     | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 196608.000000 Memory(B): 1813784.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | | | | |      Projection: movielens_test.rates_b0
     | | | | | |      Materialize: rates4.rating, rates4.id, rates4.movie_id
     | | | | | |      Filter: (rates4.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
     | | | | | |      Execute on: All Nodes
     | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | | | | |      LDISTRIB_SEGMENTED
     | +-- Inner -> SELECT [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 20)
     | |      Execute on: All Nodes
     | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | |      LDISTRIB_UNSEGMENTED
     | | +---> STORAGE ACCESS for rates1 [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 21)
     | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 196608.000000 Memory(B): 1209184.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
     | | |      Projection: movielens_test.rates_b0
     | | |      Materialize: rates1.rating, rates1.id
     | | |      Filter: (rates1.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
     | | |      Execute on: All Nodes
     | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
     | | |      LDISTRIB_SEGMENTED
     
     
     ------------------------------




  • Options
    I learned about PROFILE, ran it, and put the results here: https://gist.github.com/anonymous/8935190 . It is 8K lines long. Please tell me: Where do I start analyzing it?
  • Options
    For posterity; I would begin by sorting on the counter_value column. You'll quickly see that the JOIN's are the costliest operations. You may want to explore pre-join projections and join predicates or consider de-normalizing your schema (if possible).
  • Options
    I am not good enough to give you an indepth explanation of what happened to your query in Vertica. But just looking at your query, I can tell it is going to be very expensive in any database. For example, the query self-joins Rates table four times and Movies twice. And some of the Join conditions are Not Equals. Even though individual tables is only in the 100k range, joining them like this does not lead to 100K scan-based performance (e.g. group by)

    There are so many variables you can look into - Am I correct that the tables/super-projection are Segmented? Try putting them on to one node with plenty of memory (8G+) 

    If you like to see which JOIN slows things down the most, maybe you can tweak with number of tables to join or the join condition, just to see if anything improves.

    Lastly, if you have the resources, put this query into a ROW-based database, e.g. MySQL or Oracle (a la Amazon's RDS) Start an instance with enough memory to put the whole table in. It won't surprise me that the kind of query you are running may do well with a ROW-based DB that aggressively take advantage of memory allocation.

    Hope this helps,

    Bing
  • Options
    Thanks for this, Bing. I agree the query is expensive - it's misusing the database to walk along a path through instances: user -> rates -> movie -> rated by -> user, etc. The reason we're using Vertica is because PostgreSQL never finished this query, which takes 10 minutes on Vertica and actually completes. I'll think about your suggestion of putting all projections on one node.
  • Options

    Just a brief comment.  I see "NO STATISTICS" showing up very often in the explain. That's not helping. So, running analyze_statistics is going to help here.

    Make sure the PK and FK relationships are defined. Vertica doesn't enforce them, but it will help the optimizer make good decisions.

    I see RESEGMENT showing up frequently. That tells me the segmentation definition might not be consistent. Tables that are joined, which are segmented, should be segmented in the same way in order to ensure that the joins happen locally - and data isn't being moved around the cluster.

    The "<>" are not helping. These are inefficient by design. If there was a way to remove those, that would likely speed it up a lot.

    Hope that helps!

  • Options
    Hi Curtis. Thanks very much for your comment. Replies in-line:

    > Just a brief comment.  I see "NO STATISTICS" showing up very often in the explain. That's not helping. So, running analyze_statistics is going to help here.

    Yes, I looked at that and found that, at least for this small database, the planner's row estimates were perfect. Thus the performance with and without statisties was the same. Good call, though.


    > Make sure the PK and FK relationships are defined. Vertica doesn't enforce them, but it will help the optimizer make good decisions.

    I hadn't thought about that, but I just verified that they are defined.


    > I see RESEGMENT showing up frequently. That tells me the segmentation definition might not be consistent. Tables that are joined, which are segmented, should be segmented in the same way in order to ensure that the joins happen locally - and data isn't being moved around the cluster.

    Thanks for bringing this up - it is still confusing to me. I'm using the default segmentation that Vertica chooses when I create the database from scratch. I've put the 'SELECT export_objects' output at https://gist.github.com/anonymous/9891527 if you want to take a look at the CREATE PROJECTION statements. They are all 'SEGMENTED BY hash'. For the three tables in the query:
    CREATE PROJECTION movielens_test.ml_user /*+createtype(L)*/
    ...
    SEGMENTED BY hash(ml_user.id, ml_user.gender, ml_user.age_range, ml_user.occupation, ml_user.zipcode, ml_user.num_ratings) ALL NODES KSAFE 1;

    CREATE PROJECTION movielens_test.movie /*+createtype(L)*/
    ...
    SEGMENTED BY hash(movie.id, movie.year, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings, movie.title, movie.imdb_id, movie.rotten_tomatoes_id) ALL NODES KSAFE 1;

    CREATE PROJECTION movielens_test.rates /*+createtype(L)*/
    ...
    SEGMENTED BY hash(rates.id, rates.ml_user_id, rates.movie_id, rates.rating) ALL NODES KSAFE 1;
    I would /love/ to hear your suggestions on how to change this to be faster.


    > The "<>" are not helping. These are inefficient by design. If there was a way to remove those, that would likely speed it up a lot.

    Yes, I wish I could eliminate those (I read your docs about how this punts the planner into hashes), but it's the only way we know of to get this type of query to be correct. Like I tried to explain below, it's kind of misusing the database to walk along all 'paths' starting with rates ids and then calculating an aggregation one the last id. In this case: "[rates, movie, rates, ml_user, rates, movie, rates].rating". It's part of our machine learning research, which partly explains the oddball use of RDBMS's. Not sure that makes sense...

    Back to your point, the inequalities ensure we don't reuse backtrack through ids we've seen before.


    Thanks!


Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file