analytical function degrading the performance
I have a query which takes more than 3 hours .. the main fact table is written in a 3 level inner query.
In one sub query we use lag function with more than 20 columns as partition clause and same columns in order by of partiotion.
This is the bottle neck I see in explainplan as well as I query by pieces and stuck here.
How to deal with this scenario. Can a projection with the columns in analytical partition help me?
0
Comments
EXPLAIN PLAN FOR LAG:
Access Path:
LDISTRIB_UNSEGMENTED
+-ANALYTICAL [Cost: 8192908.000000, Rows: 6719672.000000 Disk(B): 35627700944.000000 CPU(B): 1321764404451.936768 Memory(B): 112574665016.000000 Netwrk(B): 76946964072.000000 Parallelism: 12.000000 (NO STATISTICS)] [OutRowSz (B): 11451] (PATH ID: 1)
| Analytic Group
| Functions: lag()
| Group Global & Local Resegment: upper(IFF.SNN), upper(IFF.PNN), upper(IFF.CTT), upper(IFF.CNTSS), upper(IFF.CHRM), upper(IFF.feed), upper(IFF.DOC), upper(IFF.IPBIN), upper(IFF.OPBIN), upper(IFF.CONFAC), upper(IFF.MEDSIZ), upper(IFF.JOB), upper(IFF.PLEX), upper(IFF.LTR), upper(IFF.unit), upper(IFF.IMG), upper(IFF.PRQULI), upper(IFF.media_key), upper(IFF."position"), upper(IFF.SEL), upper(IFF.INK), upper(IFF.CATR)
| Group Sort: upper(IFF.SNN) ASC, upper(IFF.PNN) ASC, upper(IFF.CTT) ASC, upper(IFF.CNTSS) ASC, upper(IFF.CHRM) ASC, upper(IFF.feed) ASC, upper(IFF.DOC) ASC, upper(IFF.IPBIN) ASC, upper(IFF.OPBIN) ASC, upper(IFF.CONFAC) ASC, upper(IFF.MEDSIZ) ASC, upper(IFF.JOB) ASC, upper(IFF.PLEX) ASC, upper(IFF.LTR) ASC, upper(IFF.unit) ASC, upper(IFF.IMG) ASC, upper(IFF.PRQULI) ASC, upper(IFF.media_key) ASC, upper(IFF."position") ASC, upper(IFF.SEL) ASC, upper(IFF.INK) ASC, upper(IFF.CATR) ASC, upper(IFF.SNN) ASC NULLS LAST, upper(IFF.PNN) ASC NULLS LAST, upper(IFF.CTT) ASC NULLS LAST, upper(IFF.CNTSS) ASC NULLS LAST, upper(IFF.CHRM) ASC NULLS LAST, upper(IFF.feed) ASC NULLS LAST, upper(IFF.DOC) ASC NULLS LAST, upper(IFF.IPBIN) ASC NULLS LAST, upper(IFF.OPBIN) ASC NULLS LAST, upper(IFF.CONFAC) ASC NULLS LAST, upper(IFF.MEDSIZ) ASC NULLS LAST, upper(IFF.JOB) ASC NULLS LAST, upper(IFF.PLEX) ASC NULLS LAST, upper(IFF.LTR) ASC NULLS LAST, upper(IFF.unit) ASC NULLS LAST, upper(IFF.IMG) ASC NULLS LAST, upper(IFF.PRQULI) ASC NULLS LAST, upper(IFF.media_key) ASC NULLS LAST, upper(IFF."position") ASC NULLS LAST, upper(IFF.SEL) ASC NULLS LAST, upper(IFF.INK) ASC NULLS LAST, upper(IFF.CATR) ASC NULLS LAST, IFF.capture_dtm ASC NULLS LAST, IFF.value_float DESC NULLS FIRST
| Execute on: All Nodes
| LDISTRIB_SEGMENTED
| Execute on: All Nodes
| LDISTRIB_UNSEGMENTED
| +---> UNION ALL [Cost: 46025.000000, Rows: 6719672.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 2651] (PATH ID: 3)
| | Execute on: All Nodes
| | LDISTRIB_UNSEGMENTED
Hi ,
Analytic functions in some cases can be very slow comparing to regular aggregation functions , as many times Analytic functions run’s a single threads and for large data set is the main bottleneck .
Projections can improved the sorting part , but will not change the single threads executions . As you do not attach the query its hard to advice how to advice you the write the query
I hope you will find it helpful
Thanks
SELECT LAG (value_f ,1) OVER (
PARTITION BY UPPER (sn),
UPPER (pn),
UPPER (ct),
UPPER (csc),
UPPER (chrm),
UPPER (feed),
UPPER (df),
UPPER (ipbin),
UPPER (opbin),
UPPER (cf),
UPPER (msn),
UPPER (jb),
UPPER (pr),
UPPER (lse),
UPPER (unit),
UPPER (image_type),
UPPER (pq),
UPPER (mk),
UPPER (position),
UPPER (sen),
UPPER (iu),
UPPER (cpn)
ORDER BY
UPPER (sn),
UPPER (pn),
UPPER (ct),
UPPER (csc),
UPPER (chrm),
UPPER (feed),
UPPER (df),
UPPER (ipbin),
UPPER (opbin),
UPPER (cf),
UPPER (msn),
UPPER (jb),
UPPER (pr),
UPPER (lse),
UPPER (unit),
UPPER (image_type),
UPPER (pq),
UPPER (mk),
UPPER (position),
UPPER (sen),
UPPER (iu),
UPPER (cpn),
capture_d ASC,
value_f DESC)pr_f ,gi_fact.*
FROM (SELECT gi.file_name,
dev.bornd,
gi.ticket_type,
gi.gi_flag,
gi.ca_status,
gi.fqi,
dev.sn,
dev.pn,
gi.ct,
gi.csc,
gi.chromatic_mode,
gi.feed,
gi.df,
gi.ipbin,
gi.opbin,
gi.cf,
gi.msn,
gi.jb,
gi.pr,
gi.lse,
gi.unit,
gi.image_type,
gi.pq,
gi.mk,
gi."position",
gi.sen,
TRIM(gi.iu) iu,
gi.cpn,
gi.capture_d ,
gi.value_f
from STG_DEV_IMP gi, STG_DEV_INFO dev
WHERE gi.fn = dev.fn
UNION ALL
SELECT null fn,
bornd,
ticket_type_desc as ticket_type,
gi_flag,
ca_status,
fqi,
sn,
pn,
ct,
csc,
chrm,
feed,
df,
ipbin,
opbin,
cf,
msn,
jb,
pr,
lse,
unit,
image_type,
pq,
mk,
"position",
sen,
TRIM(iu) iu,
catr AS cpn,
capture_d,
GLB_im as value_float
from FACT_GL_IMP WHERE sn||pn IN (SELECT DISTINCT im.sn || im.pn FROM STG_DEV_INFO im) )gi_fact
FACT_GL_IMP table is huge Fact table with morethan 1.5 TB of storage space.
How does the grouped clause for creating a query specific projection like below help me.
CREATE PROJECTION perf_tem_per2_grp /*+createtype(L)*/
(grouped(na_id,name))
AS
SELECT perf_tem.name,perf_tem.na_id
FROM PPS_DEVICE_QARI.perf_tem
ORDER BY perf_tem.na_id,perf_tem.name
SEGMENTED BY hash(perf_tem.na_id, perf_tem.name) ALL NODES KSAFE 1;
If you execute this often and want to optimize for it with a projection, you need to consider a couple of things:
- You are using UPPER() on a number of columns. You should "fix" your data model so that the data is normalized on the way in so that UPPER() isn't needed. Such expressions become expensive if they are always needed on billions of rows. Another option is an expression projection, but you should really start with a clean data model.
- Projections can only optimize data sorted in ascending order. One of your columns is sorted descending. If it's possible to rewrite the query so that all data is sorted ascending, and use LEAD() instead of LAG(), then you could optimize using a projection.
--Sharon
How does expression projection work?
If I create a projection with Upper of columns and use the query as it is will the expression projection be picked?
See the documentation:
https://my.vertica.com/docs/7.2.x/HTML/index.htm#Authoring/AnalyzingData/AggregatedData/ExpressionsAggregateData.htm