We're Moving!

The Vertica Forum is moving to a new OpenText Analytics Database (Vertica) Community.

Join us there to post discussion topics, learn about

product releases, share tips, access the blog, and much more.

Create My New Community Account Now


How to avoid flattening of subqueries in WHERE clauses? — Vertica Forum

How to avoid flattening of subqueries in WHERE clauses?

Hi there,

 

I have a question regarding the way that the query planner flattens subqueries in WHERE clauses to one or more JOINs.

 

Here is an example table:

CREATE TABLE geo(
uuid int NOT NULL,
type int NOT NULL,
cid int
)
ORDER BY geo.type, geo.cid
SEGMENTED BY hash(geo.uuid) ALL NODES KSAFE 1;

dbadmin=> SELECT type, count(*) FROM geo GROUP BY type;
type | count
------+------------
0 | 1500000000
1 | 1073741824
-1 | 426258176
(3 rows)

Let's filter using a highly selective subquery in a WHERE clause with an OR:

SELECT count(*) FROM geo
WHERE type=-1
OR type=1 AND cid IN (SELECT 2223); --Illustrative highly selective subquery
count
-----------
426258177
(1 row)
Time: First fetch (1 row): 6954.000 ms. All rows formatted: 6990.000 ms

This is suprising slow. Let's inline the subquery's result:

SELECT count(*) FROM geo
WHERE type=-1
OR type=1 AND cid IN (2223); --Inline the subquery's results
count
-----------
426258177
(1 row)
Time: First fetch (1 row): 329.000 ms. All rows formatted: 367.000 ms

Much better!

 

How about adding another OR with an additional subquery?

SELECT count(*) FROM geo
WHERE type=-1
OR type=1 AND cid IN (SELECT 2223) --Illustrative highly selective subquery
OR type=0 AND cid IN (SELECT 1038); --Illustrative highly selective subquery
count
-----------
426258178
(1 row)
Time: First fetch (1 row): 15351.000 ms. All rows formatted: 15388.000 ms

Now it is even slower. So what is going on here? Let's look at the query plan:

Access Path:
+-GROUPBY NOTHING [Cost: 3M, Rows: 1] (PATH ID: 1)
| Aggregates: count(*)
| Execute on: All Nodes
| +---> JOIN HASH [Semi] [Cost: 3M, Rows: 750M] (PATH ID: 2) Inner (BROADCAST)
| | Join Cond: (geo.cid = VAL(3))
| | Materialize at Input: geo.cid
| | Materialize at Output: geo.type
| | Filter: ((geo.type = (-1)) OR ((geo.type = 1) AND VAL(2)) OR ((geo.type = 0) AND VAL(3)))
| | Execute on: All Nodes
| | +-- Outer -> JOIN HASH [Semi] [Cost: 2M, Rows: 2B] (PATH ID: 3) Inner (BROADCAST)
| | | Join Cond: (geo.cid = VAL(2))
| | | Execute on: All Nodes
| | | +-- Outer -> STORAGE ACCESS for geo [Cost: 1M, Rows: 3B] (PATH ID: 4)
| | | | Projection: public.geop_b0
| | | | Materialize: geo.cid
| | | | Execute on: All Nodes
| | | +-- Inner -> SELECT [Cost: 10, Rows: 1] (PATH ID: 5)
| | | | Execute on: Query Initiator
| | | | +---> STORAGE ACCESS for dual [Cost: 10, Rows: 1] (PATH ID: 6)
| | | | | Projection: v_catalog.dual_p
| | | | | Execute on: Query Initiator
| | +-- Inner -> SELECT [Cost: 10, Rows: 1] (PATH ID: 7)
| | | Execute on: Query Initiator
| | | +---> STORAGE ACCESS for dual [Cost: 10, Rows: 1] (PATH ID: 8)
| | | | Projection: v_catalog.dual_p
| | | | Execute on: Query Initiator

It looks like it flattened the ORs into a configuration of JOINs, including a BROADCAST step.

 

Let's look at the query plan for an inline case:

Access Path:
+-GROUPBY NOTHING [Cost: 3M, Rows: 1] (PATH ID: 1)
| Aggregates: count(*)
| Execute on: All Nodes
| +---> STORAGE ACCESS for geo [Cost: 3M, Rows: 1B] (PATH ID: 2)
| | Projection: public.geop_b0
| | Materialize: geo.type, geo.cid
| | Filter: ((geo.type = (-1)) OR ((geo.type = 1) AND (geo.cid = 2223)) OR ((geo.type = 0) AND (geo.cid = 1038)))
| | Execute on: All Nodes

Much better!

 

So the question is how to avoid this flattening behavior as it is clearly not a good idea when selective subqueries are used?

 

Any help much appreciated!

 

 

 

Comments

  • Great question, I just found the same issue. Getting an answer would be great.

  • [Deleted User][Deleted User] Administrator

    Do you have "SELECT 2223" as subquery in your real application or it is actually a query that can return set of values. This specific example can be optimized but I guess your real use case could be different.

    I can open enhancement request for this specific example.

  • I think a common use case would be keeping offsets in a table for processing rows. These would often be timestamps or integers. You would think want to query the offset table to build the next where clause.

    Possible feature enhancements:

    • Planner is improved
    • New hint to treat a subquery as a scaler
    • 1 row CTE values are treated as scalers
  • So we managed to get past this with a really ugly hack. Our java code creates a scaler function that the query uses. This has a huge impact on performance. Some of our ETL jobs moved from 3 hours down to 20 minutes after this.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file