Query performance problem
We see disparities in query execution times which we do not understand. The tests we ran (see below outputs) are all reproducible on our production system. Our production system is running Vertica 6.1.1. celtra=> SELECT VERSION(); VERSION ------------------------------------ Vertica Analytic Database v6.1.1-0 (1 row) We have a base table sessions: CREATE TABLE sessions ( day DATE NOT NULL, creativeId CHAR(8) NOT NULL, sessionId INTEGER NOT NULL, ... PRIMARY KEY (day, creativeId, sessionId) ); and we defined two projections segmented by MODULARHASH: CREATE PROJECTION sessions_super_by_day AS SELECT * FROM sessions ORDER BY day, creativeId, sessionId SEGMENTED BY MODULARHASH(day, creativeId, sessionId) ALL NODES; CREATE PROJECTION sessions_super_by_creativeId AS SELECT * FROM sessions ORDER BY creativeId, day, sessionId SEGMENTED BY MODULARHASH(day, creativeId, sessionId) ALL NODES OFFSET 1; Next, we query both projections (not simultaneously) and measure query execution times. celtra=> SELECT count(*) FROM sessions_super_by_day; count ------------ 1893063056 (1 row) Time: First fetch (1 row): 3745.070 ms. All rows formatted: 3745.177 ms celtra=> SELECT count(*) FROM sessions_super_by_creativeId; count ------------ 1893063056 (1 row) Time: First fetch (1 row): 49087.502 ms. All rows formatted: 49087.583 ms Cardinalities of day and creativeId columns are both small (~100 and ~10000). There are the outputs of corresponding EXPLAIN statements for above queries: EXPLAIN VERBOSE SELECT count(*) FROM sessions_super_by_day; Access Path: LDISTRIB_UNSEGMENTED +-GROUPBY NOTHING [Cost: 1535186.000000, Rows: 1.000000 Disk(B): 0.000000 CPU(B): 15144504448.000000 Memory(B): 32.000000 Netwrk(B): 8.000000 Parallelism: 4.000000] [OutRowSz (B): 8] (PATH ID: 1) | Aggregates: count(*) | Execute on: All Nodes | LDISTRIB_UNSEGMENTED | +---> STORAGE ACCESS for sessions_super_by_day [Cost: 1442750.000000, Rows: 1893063056.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000] [OutRowSz (B): 8] (PATH ID: 2) | | Column Cost Aspects: [ Disk(B): 20609105920.000000 CPU(B): 0.000000 Memory(B): 15144504448.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ] | | Projection: mab.sessions_super_by_day | | Materialize: sessions_super_by_day.day | | Execute on: All Nodes | | Sort Key: (sessions_super_by_day.day, sessions_super_by_day.creativeId, sessions_super_by_day.sessionId) | | LDISTRIB_SEGMENTED EXPLAIN VERBOSE SELECT count(*) FROM sessions_super_by_creativeId; Access Path: LDISTRIB_UNSEGMENTED +-GROUPBY NOTHING [Cost: 370629.000000, Rows: 1.000000 Disk(B): 0.000000 CPU(B): 30289008896.000000 Memory(B): 32.000000 Netwrk(B): 8.000000 Parallelism: 4.000000] [OutRowSz (B): 8] (PATH ID: 1) | Aggregates: count(*) | Execute on: All Nodes | LDISTRIB_UNSEGMENTED | +---> STORAGE ACCESS for sessions_super_by_creativeId [Cost: 185758.000000, Rows: 1893063056.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000] [OutRowSz (B): 16] (PATH ID: 2) | | Column Cost Aspects: [ Disk(B): 14548992.000000 CPU(B): 0.000000 Memory(B): 15144504448.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ] | | Projection: mab.sessions_super_by_creativeId | | Materialize: sessions_super_by_creativeId.creativeId | | Execute on: All Nodes | | Sort Key: (sessions_super_by_creativeId.creativeId, sessions_super_by_creativeId.day, sessions_super_by_creativeId.sessionId) | | LDISTRIB_SEGMENTED Our first question is why is there such a big disparity in query execution times? When we modify our query to include a filter on 'day' column (WHERE condition), which effectively encompasses all data, the query response time on both projections becomes fast. celtra=> SELECT count(*) FROM sessions_super_by_day WHERE day < '2020-01-02'; count ------------ 1893063056 (1 row) Time: First fetch (1 row): 4181.928 ms. All rows formatted: 4182.025 ms celtra=> SELECT count(*) FROM sessions_super_by_creativeId WHERE day < '2020-01-02'; count ------------ 1893063056 (1 row) Time: First fetch (1 row): 4775.328 ms. All rows formatted: 4775.423 ms Again, we run EXPLAIN on above queries. EXPLAIN VERBOSE SELECT count(*) FROM sessions_super_by_day WHERE day < '2020-01-02'; Access Path: LDISTRIB_UNSEGMENTED +-GROUPBY NOTHING [Cost: 1552538.000000, Rows: 1.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 32.000000 Netwrk(B): 8.000000 Parallelism: 4.000000] [OutRowSz (B): 8] (PATH ID: 1) | Aggregates: count(*) | Execute on: All Nodes | LDISTRIB_UNSEGMENTED | +---> STORAGE ACCESS for sessions_super_by_day [Cost: 1552537.000000, Rows: 1893063055.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000] [OutRowSz (B): 0] (PATH ID: 2) | | Column Cost Aspects: [ Disk(B): 20370783669.297089 CPU(B): 20370783669.297089 Memory(B): 15144504439.302990 Netwrk(B): 0.000000 Parallelism: 4.000000 ] | | Projection: mab.sessions_super_by_day | | Filter: (sessions_super_by_day.day < '2020-01-02'::date)/* sel=1.000000 ndv= 77 */ | | Execute on: All Nodes | | Sort Key: (sessions_super_by_day.day, sessions_super_by_day.creativeId, sessions_super_by_day.sessionId) | | LDISTRIB_SEGMENTED EXPLAIN VERBOSE SELECT count(*) FROM sessions_super_by_creativeId WHERE day < '2020-01-02'; Access Path: LDISTRIB_UNSEGMENTED +-GROUPBY NOTHING [Cost: 1603061.000000, Rows: 1.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 32.000000 Netwrk(B): 8.000000 Parallelism: 4.000000] [OutRowSz (B): 8] (PATH ID: 1) | Aggregates: count(*) | Execute on: All Nodes | LDISTRIB_UNSEGMENTED | +---> STORAGE ACCESS for sessions_super_by_creativeId [Cost: 1603060.000000, Rows: 1893063055.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000] [OutRowSz (B): 0] (PATH ID: 2) | | Column Cost Aspects: [ Disk(B): 21123301376.000000 CPU(B): 21123301376.000000 Memory(B): 15144504439.302990 Netwrk(B): 0.000000 Parallelism: 4.000000 ] | | Projection: mab.sessions_super_by_creativeId | | Filter: (sessions_super_by_creativeId.day < '2020-01-02'::date)/* sel=1.000000 ndv= 77 */ | | Execute on: All Nodes | | Sort Key: (sessions_super_by_creativeId.creativeId, sessions_super_by_creativeId.day, sessions_super_by_creativeId.sessionId) | | LDISTRIB_SEGMENTED Our second question is why did inclusion of arbitrary filter (it does not actually filter anything) suddenly speed up the query on the 'slow' projection? We modify our query again, filtering on creativeId column instead of day column. Again, the filter is composed such that it encompasses all data (effectively nothing is filtered out). celtra=> SELECT count(*) FROM sessions_super_by_day WHERE creativeId < 'ffffffff'; count ------------ 1893063056 (1 row) Time: First fetch (1 row): 63830.530 ms. All rows formatted: 63830.608 ms celtra=> SELECT count(*) FROM sessions_super_by_creativeId WHERE creativeId < 'ffffffff'; count ------------ 1893063056 (1 row) Time: First fetch (1 row): 70580.566 ms. All rows formatted: 70580.664 ms Again, the output of EXPLAIN statements: EXPLAIN VERBOSE SELECT count(*) FROM sessions_super_by_day WHERE creativeId < 'ffffffff'; Access Path: LDISTRIB_UNSEGMENTED +-GROUPBY NOTHING [Cost: 186081.000000, Rows: 1.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 32.000000 Netwrk(B): 8.000000 Parallelism: 4.000000] [OutRowSz (B): 8] (PATH ID: 1) | Aggregates: count(*) | Execute on: All Nodes | LDISTRIB_UNSEGMENTED | +---> STORAGE ACCESS for sessions_super_by_day [Cost: 186080.000000, Rows: 1893063055.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000] [OutRowSz (B): 0] (PATH ID: 2) | | Column Cost Aspects: [ Disk(B): 18022400.000000 CPU(B): 18022400.000000 Memory(B): 15144504439.302990 Netwrk(B): 0.000000 Parallelism: 4.000000 ] | | Projection: mab.sessions_super_by_day | | Filter: (sessions_super_by_day.creativeId < 'ffffffff')/* sel=1.000000 ndv= 2885 */ | | Execute on: All Nodes | | Sort Key: (sessions_super_by_day.day, sessions_super_by_day.creativeId, sessions_super_by_day.sessionId) | | LDISTRIB_SEGMENTED EXPLAIN VERBOSE SELECT count(*) FROM sessions_super_by_creativeId WHERE creativeId < 'ffffffff'; Access Path: LDISTRIB_UNSEGMENTED +-GROUPBY NOTHING [Cost: 185916.000000, Rows: 1.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 32.000000 Netwrk(B): 8.000000 Parallelism: 4.000000] [OutRowSz (B): 8] (PATH ID: 1) | Aggregates: count(*) | Execute on: All Nodes | LDISTRIB_UNSEGMENTED | +---> STORAGE ACCESS for sessions_super_by_creativeId [Cost: 185915.000000, Rows: 1893063055.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000] [OutRowSz (B): 0] (PATH ID: 2) | | Column Cost Aspects: [ Disk(B): 15570621.668079 CPU(B): 15570621.668079 Memory(B): 15144504439.302990 Netwrk(B): 0.000000 Parallelism: 4.000000 ] | | Projection: mab.sessions_super_by_creativeId | | Filter: (sessions_super_by_creativeId.creativeId < 'ffffffff')/* sel=1.000000 ndv= 2885 */ | | Execute on: All Nodes | | Sort Key: (sessions_super_by_creativeId.creativeId, sessions_super_by_creativeId.day, sessions_super_by_creativeId.sessionId) | | LDISTRIB_SEGMENTED As can be seen from above, now queries on both projections are slow. What slowed down the queries? We tried adding different permutations of filters, which are not actually limiting any data: - day < '2020-01-01' - creativeId < 'ffffffff' (referred to as "cid" below) - sessionId < 999999999999 (referred to as "sid" below) projection / filter by columns | no filter | day | cid | sid | day + cid | day + sid | cid + sid | day + cid + sid | -------------------------------|--------------|------------------------------------------------------------------------------------------------------- sessions_super_by_day | 3745.070 ms | 4181.928 ms | 63830.530 ms | 4329.763 ms | 6313.164 ms | 5753.121 ms | 6433.096 ms | 7947.076 ms | sessions_super_by_creativeId | 49087.502 ms | 4775.328 ms | 70580.566 ms | 5217.864 ms | 7269.565 ms | 6206.740 ms | 8210.272 ms | 8265.184 ms | We know for a fact that: - Slow queries take >10x more time to execute on the same data - Query plan outputs for the queries that are *slow* show orders of magnitude (> 1000x) *lower* Cost estimate, which is affected by order of magnitude smaller Disk estimates. Your prompt attention to this matter is appreciated. Best regards, Grega
0
Comments