Optimizing performance of a query with row_number
I am trying to optimize the performance of a query.
Here is an example of 2 queries that demonstrate the issue we have.
Both queries return the same results and have the same explain plan but the execution time is very different.
Query 1 - execution time 0.3 sec
with test as
(
SELECT
case when ROW_NUMBER() OVER (PARTITION BY PURCHASE_ID ORDER BY ACTION_TIME ASC, GOAL_MATCH_IND DESC NULLS LAST)=1 then value end as REVENUE
FROM B_DEFAULT.RPT_FA_CONTRIBUTION F
where account_id = '123'
and purchase_id is not null
)
select sum(REVENUE)
from test
explain:
Access Path:
+-GROUPBY NOTHING [Cost: 89K, Rows: 1] (PATH ID: 1)
| Aggregates: sum(test.REVENUE)
| Execute on: All Nodes
| Execute on: All Nodes
| +---> ANALYTICAL [Cost: 89K, Rows: 2M] (PATH ID: 3)
| | Analytic Group
| | Functions: row_number()
| | Group Global & Local Resegment: F.PURCHASE_ID
| | Group Sort: F.PURCHASE_ID ASC, F.ACTION_TIME ASC NULLS LAST, F.GOAL_MATCH_IND DESC NULLS LAST
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for F [Cost: 7K, Rows: 2M] (PATH ID: 4)
| | | Projection: B_DEFAULT.RPT_FA_CONTRIBUTION_CROSS_SESSION_b0
| | | Materialize: F.PURCHASE_ID, F.VALUE, F.ACTION_TIME, F.GOAL_MATCH_IND
| | | Filter: (F.ACCOUNT_ID = '123')
| | | Filter: (F.PURCHASE_ID IS NOT NULL)
| | | Execute on: All Nodes
Query 2 - execution time 3.3 sec
(removed the purchase_id is not null filter)
(removed the purchase_id is not null filter)
with test as
(
SELECT
case when ROW_NUMBER() OVER (PARTITION BY PURCHASE_ID ORDER BY ACTION_TIME ASC, GOAL_MATCH_IND DESC NULLS LAST)=1 then value end as REVENUE
FROM B_DEFAULT.RPT_FA_CONTRIBUTION F
where account_id = '123'
)
select sum(REVENUE)
from test
explain -
Access Path:
+-GROUPBY NOTHING [Cost: 89K, Rows: 1] (PATH ID: 1)
| Aggregates: sum(test.REVENUE)
| Execute on: All Nodes
| Execute on: All Nodes
| +---> ANALYTICAL [Cost: 89K, Rows: 2M] (PATH ID: 3)
| | Analytic Group
| | Functions: row_number()
| | Group Global & Local Resegment: F.PURCHASE_ID
| | Group Sort: F.PURCHASE_ID ASC, F.ACTION_TIME ASC NULLS LAST, F.GOAL_MATCH_IND DESC NULLS LAST
| | Execute on: All Nodes
| | +---> STORAGE ACCESS for F [Cost: 7K, Rows: 2M] (PATH ID: 4)
| | | Projection: B_DEFAULT.RPT_FA_CONTRIBUTION_CROSS_SESSION_b0
| | | Materialize: F.VALUE, F.ACTION_TIME, F.GOAL_MATCH_IND, F.PURCHASE_ID
| | | Filter: (F.ACCOUNT_ID = '123')
| | | Execute on: All Nodes
* The reason that both queries return the same result is that the value column is null where purchase_id is also null.
Though it does not look like it from the explain plans , I believe that query 1 is faster because the analytical row_number function is running on a smaller number of rows.
I am trying to understand how to re-write query 2 to improve it's performance but without adding the filter on the where clause.
I tried adding this filter to the case, like:
case when purchase_id is not null and ROW_NUMBER()....
But this does not help.
We would love to hear your ideas
0
Comments
I checked with a technical expert and his reply was:
The first query will process fewer number of rows because of the additional filter, unless the purchase_id column does not contain NULL.
I hope that helps.
Sarah
Hi Sarah,
That's true, but I am looking for a way to filter out the nulls without adding the condition ( ... is not null) to the where clause.
The reason for this is that I use the inner query as part of a view and some calculated columns in the view do require going over the null values.
Thank you for your help.
I'm not clear on whether you want to ignore the NULL values or not.
Can you file a support case with Vertica? I think this requires more details.
Sorry that I'm not more help,
Sarah