How to avoid flattening of subqueries in WHERE clauses?

rikusg · May 2016

Hi there,

I have a question regarding the way that the query planner flattens subqueries in WHERE clauses to one or more JOINs.

Here is an example table:

CREATE TABLE geo(
    uuid int NOT NULL,
    type int NOT NULL,
    cid int
)
ORDER BY geo.type, geo.cid
SEGMENTED BY hash(geo.uuid) ALL NODES KSAFE 1;

dbadmin=> SELECT type, count(*) FROM geo GROUP BY type;
 type |   count
------+------------
    0 | 1500000000
    1 | 1073741824
   -1 |  426258176
(3 rows)

Let's filter using a highly selective subquery in a WHERE clause with an OR:

SELECT count(*) FROM geo
WHERE type=-1
OR type=1 AND cid IN (SELECT 2223);  --Illustrative highly selective subquery
count
-----------
426258177
(1 row)
Time: First fetch (1 row): 6954.000 ms. All rows formatted: 6990.000 ms

This is suprising slow. Let's inline the subquery's result:

SELECT count(*) FROM geo
WHERE type=-1
OR type=1 AND cid IN (2223);    --Inline the subquery's results
count
-----------
426258177
(1 row)
Time: First fetch (1 row): 329.000 ms. All rows formatted: 367.000 ms

Much better!

How about adding another OR with an additional subquery?

SELECT count(*) FROM geo
WHERE type=-1
OR type=1 AND cid IN (SELECT 2223)  --Illustrative highly selective subquery
OR type=0 AND cid IN (SELECT 1038); --Illustrative highly selective subquery
count
-----------
426258178
(1 row)
Time: First fetch (1 row): 15351.000 ms. All rows formatted: 15388.000 ms

Now it is even slower. So what is going on here? Let's look at the query plan:

Access Path:
+-GROUPBY NOTHING [Cost: 3M, Rows: 1] (PATH ID: 1)
|  Aggregates: count(*)
|  Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 3M, Rows: 750M] (PATH ID: 2) Inner (BROADCAST)
| |      Join Cond: (geo.cid = VAL(3))
| |      Materialize at Input: geo.cid
| |      Materialize at Output: geo.type
| |      Filter: ((geo.type = (-1)) OR ((geo.type = 1) AND VAL(2)) OR ((geo.type = 0) AND VAL(3)))
| |      Execute on: All Nodes
| | +-- Outer -> JOIN HASH [Semi] [Cost: 2M, Rows: 2B] (PATH ID: 3) Inner (BROADCAST)
| | |      Join Cond: (geo.cid = VAL(2))
| | |      Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for geo [Cost: 1M, Rows: 3B] (PATH ID: 4)
| | | |      Projection: public.geop_b0
| | | |      Materialize: geo.cid
| | | |      Execute on: All Nodes
| | | +-- Inner -> SELECT [Cost: 10, Rows: 1] (PATH ID: 5)
| | | |      Execute on: Query Initiator
| | | | +---> STORAGE ACCESS for dual [Cost: 10, Rows: 1] (PATH ID: 6)
| | | | |      Projection: v_catalog.dual_p
| | | | |      Execute on: Query Initiator
| | +-- Inner -> SELECT [Cost: 10, Rows: 1] (PATH ID: 7)
| | |      Execute on: Query Initiator
| | | +---> STORAGE ACCESS for dual [Cost: 10, Rows: 1] (PATH ID: 8)
| | | |      Projection: v_catalog.dual_p
| | | |      Execute on: Query Initiator

It looks like it flattened the ORs into a configuration of JOINs, including a BROADCAST step.

Let's look at the query plan for an inline case:

Access Path:
+-GROUPBY NOTHING [Cost: 3M, Rows: 1] (PATH ID: 1)
|  Aggregates: count(*)
|  Execute on: All Nodes
| +---> STORAGE ACCESS for geo [Cost: 3M, Rows: 1B] (PATH ID: 2)
| |      Projection: public.geop_b0
| |      Materialize: geo.type, geo.cid
| |      Filter: ((geo.type = (-1)) OR ((geo.type = 1) AND (geo.cid = 2223)) OR ((geo.type = 0) AND (geo.cid = 1038)))
| |      Execute on: All Nodes

Much better!

So the question is how to avoid this flattening behavior as it is clearly not a good idea when selective subqueries are used?

Any help much appreciated!

mritsema · June 2017

Great question, I just found the same issue. Getting an answer would be great.

[Deleted User] · June 2017

Do you have "SELECT 2223" as subquery in your real application or it is actually a query that can return set of values. This specific example can be optimized but I guess your real use case could be different.

I can open enhancement request for this specific example.

mritsema · June 2017

I think a common use case would be keeping offsets in a table for processing rows. These would often be timestamps or integers. You would think want to query the offset table to build the next where clause.

Possible feature enhancements:

Planner is improved
New hint to treat a subquery as a scaler
1 row CTE values are treated as scalers

mritsema · July 2017

So we managed to get past this with a really ugly hack. Our java code creates a scaler function that the query uses. This has a huge impact on performance. Some of our ETL jobs moved from 3 hours down to 20 minutes after this.

We're Moving!

Create My New Community Account Now

How to avoid flattening of subqueries in WHERE clauses?

Comments

Leave a Comment