Optimizing join statements

Dror_Yosef · September 2013

Hello, Can someone elaborate about this topic from the guide "To improve the performance of queries that perform joins, make sure that HP Vertica chooses the larger table as the outer (left hand) input by ensuring that any applicable constraints are defined." I have a query that select the small table as inner table when i do count(*) but when i select fields from both tables the optimizer swapping the order Thanks

[Deleted User] · September 2013

Hi Dror, I don't have a specific answer. But I know the first question that people who are more knowledgeable than me are likely to ask, is, "could you post the EXPLAIN output for your query?" The second question would be, "have you collected up-to-date statistics on these projections?" If you have very stale statistics, this can give Vertica incorrect information about each table, in which case it can make bad decisions about join order. On the flip side, Vertica has a variety of clever optimizations that can, among other things, affect the effective size of a table as far as the JOIN is concerned. It's generally possible to figure out if that's what's going on based on the EXPLAIN output. Adam

Dror_Yosef · September 2013

Hi Adam, Thanks for you answer. I have collected statistics using ANALYZE_STATISTICS for both tables Here is the sql + explain output for it. BigTable has about 1B rows and the smallTable 5M This query runs for more than 2 minutes on 3 nodes cluster, way over any other query i do including join queries. both tables segmented by key1 (has many values) Is there a way to improve times? i ran DBD and it didn't improve timing Thanks in advance SELECT A.attr1, A.attr2 , A.attr3 , A.key1 , A.key2 , A.eDate , A.eDateHour , A.key3 ,CASE WHEN C.cnt IS NULL THEN 1 ELSE 0 END newAttr1 ,CASE WHEN D.cnt IS NULL THEN 1 ELSE 0 END newAttr2 ,A.isCombinedSdk FROM staging.smallTable A LEFT JOIN bigTable B ON A.key1 = B.key1 AND A.key2 = A.key2 AND A.key3 = B.key3 LEFT JOIN (SELECT key1,key2,COUNT(*) cnt FROM bigTable GROUP BY key1,key2) C ON A.key1 = C.key1 AND A.key2 = A.key2 LEFT JOIN (SELECT key1,COUNT(*) cnt FROM bigTable GROUP BY key1) D ON A.key1 = D.key1 WHERE B.key1 IS NULL; Access Path: +-SELECT [Cost: 11M, Rows: 100] (PATH ID: 0) | Execute on: Query Initiator | +---> JOIN HASH [RightOuter] [Cost: 11M, Rows: 1B] (PATH ID: 1) | | Join Cond: (A.key1 = C.key1) | | Join Filter: (A.key2 = A.key2) | | Filter: (B.key1 IS NULL) | | Execute on: All Nodes | | +-- Outer -> SELECT [Cost: 6M, Rows: 1B] (PATH ID: 2) | | | Execute on: All Nodes | | | +---> GROUPBY HASH (LOCAL RESEGMENT GROUPS) [Cost: 6M, Rows: 1B] (PATH ID: 3) | | | | Aggregates: count(*) | | | | Group By: bigTable.key1, bigTable.key2 | | | | Execute on: All Nodes | | | | +---> STORAGE ACCESS for bigTable [Cost: 900K, Rows: 1B] (PATH ID: 4) | | | | | Projection: public.bigTable_q2_b0 | | | | | Materialize: bigTable.key1, bigTable.key2 | | | | | Execute on: All Nodes | | | | | Runtime Filter: (SIP1(HashJoin): C.key1) | | +-- Inner -> JOIN MERGEJOIN(inputs presorted) [RightOuter] [Cost: 4M, Rows: 949K] (PATH ID: 5) | | | Join Cond: (A.key1 = B.key1) AND (A.key3 = B.key3) | | | Join Filter: (A.key2 = A.key2) | | | Materialize at Input: A.key1, A.key3, A.Attr1, A.Attr2, A.key2, A.Attr3 | | | Execute on: All Nodes | | | +-- Outer -> STORAGE ACCESS for B [Cost: 2M, Rows: 1B] (PATH ID: 6) | | | | Projection: public.bigTable_DBD_q7_1_b0 | | | | Materialize: B.key1, B.key3 | | | | Execute on: All Nodes | | | | Runtime Filters: (SIP2(MergeJoin): B.key1), (SIP3(MergeJoin): B.key3), (SIP4(MergeJoin): B.key1, B.key3) | | | +-- Inner -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 1M, Rows: 949K] (PATH ID: 7) | | | | Join Cond: (A.key1 = D.key1) | | | | Execute on: All Nodes | | | | +-- Outer -> STORAGE ACCESS for A [Cost: 748, Rows: 949K] (PATH ID: 8) | | | | | Projection: staging.smallTable_b0 | | | | | Materialize: A.key1 | | | | | Execute on: All Nodes | | | | +-- Inner -> SELECT [Cost: 1M, Rows: 1B] (PATH ID: 9) | | | | | Execute on: All Nodes | | | | | +---> GROUPBY PIPELINED [Cost: 1M, Rows: 1B] (PATH ID: 10) | | | | | | Aggregates: count(*) | | | | | | Group By: bigTable.key1 | | | | | | Execute on: All Nodes | | | | | | +---> STORAGE ACCESS for bigTable [Cost: 667K, Rows: 1B] (PATH ID: 11) | | | | | | | Projection: public.bigTable_q3_seg_b0 | | | | | | | Materialize: bigTable.key1 | | | | | | | Execute on: All Nodes

Sharon_Cutter · September 2013

Are these join conditions intentional or a typo? "A.key2 = A.key2"

Dror_Yosef · September 2013

Sorry, those are typo when i changed the original names should be A.key2 = B.key2 and A.key2 = C.key2 respectively

Sharon_Cutter · September 2013

How's the performance / explain plan with that change?

Dror_Yosef · September 2013

Hi Sharon, I added unsegmented projections recommended by DBD, still the query runs in about 2 minutes. below is the query plan Thanks for you help SELECT A.attr1, A.attr2 , A.attr3 , A.key1 , A.key2 , A.key3 ,CASE WHEN C.cnt IS NULL THEN 1 ELSE 0 END newAttr1 ,CASE WHEN D.cnt IS NULL THEN 1 ELSE 0 END newAttr2 FROM staging.smallTable A LEFT JOIN bigTable B ON A.key1 = B.key1 AND A.key2 = B.key2 AND A.key3 = B.key3 LEFT JOIN (SELECT key1,key2,COUNT(*) cnt FROM bigTable GROUP BY key1,key2) C ON A.key1 = C.key1 AND A.key2 = C.key2 LEFT JOIN (SELECT key1,COUNT(*) cnt FROM bigTable GROUP BY key1) D ON A.key1 = D.key1 WHERE B.key1 IS NULL; ; Access Path: +-JOIN HASH [RightOuter] [Cost: 6M, Rows: 1M] (PATH ID: 1) | Join Cond: (A.key1 = B.key1) AND (A.key2 = B.key2) AND (A.key3 = B.key3) | Materialize at Input: A.key1, A.key2, A.key3, A.attr1, A.attr2, A.attr3 | Filter: (B.key1 IS NULL) | Execute on: All Nodes | +-- Outer -> STORAGE ACCESS for B [Cost: 2M, Rows: 1B] (PATH ID: 2) | | Projection: public.bigTable_DBD_q8_b0 | | Materialize: B.key1, B.key3, B.key2 | | Execute on: All Nodes | | Runtime Filters: (SIP1(HashJoin): B.key1), (SIP2(HashJoin): B.key2), (SIP3(HashJoin): B.key3), (SIP4(HashJoin): B.key1, B.key2, B.key3) | +-- Inner -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 3M, Rows: 1M] (PATH ID: 3) | | Join Cond: (A.key1 = C.key1) AND (A.key2 = C.key2) | | Materialize at Input: A.key1, A.key2 | | Execute on: All Nodes | | +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 1M, Rows: 1M] (PATH ID: 4) Outer (FILTER) | | | Join Cond: (A.key1 = D.key1) | | | Execute on: All Nodes | | | +-- Outer -> STORAGE ACCESS for A [Cost: 2K, Rows: 1M] (PATH ID: 5) | | | | Projection: staging.smallTable_DBD_US1_node0001 | | | | Materialize: A.key1 | | | | Execute on: All Nodes | | | +-- Inner -> SELECT [Cost: 1M, Rows: 1B] (PATH ID: 6) | | | | Execute on: All Nodes | | | | +---> GROUPBY PIPELINED [Cost: 1M, Rows: 1B] (PATH ID: 7) | | | | | Aggregates: count(*) | | | | | Group By: bigTable.key1 | | | | | Execute on: All Nodes | | | | | +---> STORAGE ACCESS for bigTable [Cost: 673K, Rows: 1B] (PATH ID: 8) | | | | | | Projection: public.bigTable_q3_seg_b0 | | | | | | Materialize: bigTable.key1 | | | | | | Execute on: All Nodes | | +-- Inner -> SELECT [Cost: 2M, Rows: 1B] (PATH ID: 9) | | | Execute on: All Nodes | | | +---> GROUPBY PIPELINED [Cost: 2M, Rows: 1B] (PATH ID: 10) | | | | Aggregates: count(*) | | | | Group By: bigTable.key1, bigTable.key2 | | | | Execute on: All Nodes | | | | +---> STORAGE ACCESS for bigTable [Cost: 904K, Rows: 1B] (PATH ID: 11) | | | | | Projection: public.bigTable_DBD_1_seg_b0 | | | | | Materialize: bigTable.key1, bigTable.key2 | | | | | Execute on: All Nodes

Wils_Gomes · September 2013

I have a similar issue, basically I have a COUNT() to create a pagination for my application. select count(1) as total_rows from (subquery with some sum() and 2 inner joins) as as total_sums After passing the query to DBD to create a Projection the query is now 5 times slower.

We're Moving!

Create My New Community Account Now

Optimizing join statements

Comments

Leave a Comment