Loads into pre-join projections of 2 large tables don't scale
Hello, We are trying to create a pre-join projection that joins two large tables. The problem is that INSERTs are slow: they take O(existing_data), not O(added_data). In practice, this means that after importing a couple of days worth of data, they become prohibitively slow. The projections are written so that: - INSERT uses MERGE JOINs only. - Data for INSERT is presorted, i.e. no (SORT ON JOIN KEY). - When running on a multi-node cluster, there is no (RESEGMENT). - There's an extra batch_id column, that serves as a bucket, as per this help page. The following script demonstrates the issue: https://dl.dropbox.com/u/41764/vertica/prejoin-scaling/vertica-prejoin-scaling-test.sh The script will:
1. create a schema with 2 tables, 'a' and 'b' 2. create "normal" projections for 'a' and prejoin for 'b' x 'a' 3. repeat 100 times 3.1. load 1M rows into 'a_temp' 3.2. load 1M rows into 'b_temp' 3.3. insert into a select * from a_temp 3.4. insert into b select * from b_tempYou will see that the times for #3.3 are constant, but the times for #3.4 grow with the number of *existing*, not *inserted* rows. Here is a chart: Doing a SELECT .. JOIN .. for what I think Vertica has to do when populating the projections *is fast*, but INSERT .. SELECT is slow. My interpretation is that Vertica does not use all the information available (namely, the WHERE clause) when constructing the plan for the implicit JOIN. Instead of looking only through a part of the joined table, it looks through all of it. More on this here. The issue has been acknowledged as VER-25968. I was wondering whether there are any plans to fix this in the future? Best, Jaka
0
Comments