Loads into pre-join projections of 2 large tables don't scale

Hello, We are trying to create a pre-join projection that joins two large tables. The problem is that INSERTs are slow: they take O(existing_data), not O(added_data). In practice, this means that after importing a couple of days worth of data, they become prohibitively slow. The projections are written so that: - INSERT uses MERGE JOINs only. - Data for INSERT is presorted, i.e. no (SORT ON JOIN KEY). - When running on a multi-node cluster, there is no (RESEGMENT). - There's an extra batch_id column, that serves as a bucket, as per this help page. The following script demonstrates the issue: https://dl.dropbox.com/u/41764/vertica/prejoin-scaling/vertica-prejoin-scaling-test.sh The script will:
   1. create a schema with 2 tables, 'a' and 'b'   2. create "normal" projections for 'a' and prejoin for 'b' x 'a'   3. repeat 100 times       3.1. load 1M rows into 'a_temp'       3.2. load 1M rows into 'b_temp'       3.3. insert into a select * from a_temp       3.4. insert into b select * from b_temp  
You will see that the times for #3.3 are constant, but the times for #3.4 grow with the number of *existing*, not *inserted* rows. Here is a chart: imageimage Doing a SELECT .. JOIN .. for what I think Vertica has to do when populating the projections *is fast*, but INSERT .. SELECT is slow. My interpretation is that Vertica does not use all the information available (namely, the WHERE clause) when constructing the plan for the implicit JOIN. Instead of looking only through a part of the joined table, it looks through all of it. More on this here. The issue has been acknowledged as VER-25968. I was wondering whether there are any plans to fix this in the future? Best, Jaka

Comments

  • Hi Jaka, Ah, I think I do see what's going on here. It's not quite so simple as information being missed from the WHERE clause at planning time (because at planning time we don't know what values you're going to load and that knowledge is what really speeds things up), but as you can see from running other queries, certainly there are ways to take more information into account. (You compare the query plan to inserting into a table where a value is equal to a particular constant. That's cheating; you're telling us the value at plan-time, of course we can write a plan that makes that fast :-) A better comparison might be where a value is IN a subquery that fetches a few rows from a temp table.) Unfortunately, I'm not in a position to comment on future plans / release dates / etc... That would have to come through Support. (Or any other higher-ups who you might meet at conferences/events/etc -- we have lots to do here but we do try to reach out at least occasionally :-) ) It is a good catch, though. Thanks for investigating it so thoroughly! Adam
  • Haha, well, I wouldn't call it cheating. Maybe... hinting? :) In fact, it does work for a similar case: If you do INSERT .. SELECT .. ORDER BY, the ordering information is used to optimize the implicit join. I'm just hoping for the same behaviour for INSERT .. SELECT .. WHERE. Adding either ORDER BY or WHERE may or may not (e.g. if it was already sorted) change the inserted data, but it does guarantee that whatever is being inserted is either ordered or matches a predicate.
  • Now that Vertica 7 has been announced, I'm happy to report that we've been testing this with an early build and it has been fixed:

    image

    image

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file