With clause materialization and query planner oddities?
Hi All, We had some queries which based on explain analyze output seemed to be spending much of their time applying to_char formatting to timestamp columns for each row result. To improve the performance, we tried using the new "with clause" introduced in Vertica 6.1.1. (we have just upgraded in order to make use of the feature) From our understanding, the "with clause" statements should be forcing the materialization of those formatted results, thereby avoiding the cost per row. Our initial testing showed that it yielded huge performance improvements. However, with further testing it seems that the query planner doesn't always make use of the materialized (pre-formatted) values. Here is a case where one query behaves correctly, but modifying it slightly seems to make the query planner fall back to applying the function per row. -- EXPECTED BEHAVIOUR (Group by: lt.hourOfDay) -- explain analyse with local_times as (select tstamp,TO_CHAR(new_time(tstamp, 'UTC', 'America/New_York'), 'HH24') as hourOfDay from (select distinct tstamp from delivery where tstamp between '2013-03-01 05:00:00.000' and '2013-04-01 03:59:59.999') as tstamps) select lt.hourOfDay as hourOfDay, sum(delivery.impressions) as impressions from delivery inner join local_times lt on lt.tstamp = delivery.tstamp left outer join geo_countries on delivery.country = geo_countries.code where delivery.tstamp between to_timestamp('2013-03-01 05:00:00.000', 'yyyy-MM-dd hh24:mi:ss.ff3') and to_timestamp('2013-04-01 03:59:59.999', 'yyyy-MM-dd hh24:mi:ss.ff3') group by hourOfDay; Access Path: +-GROUPBY HASH (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 149K, Rows: 10K] (PATH ID: 1) | Aggregates: sum(delivery.impressions) | Group By: lt.hourOfDay | Execute on: All Nodes | +---> JOIN MERGEJOIN(inputs presorted) [Cost: 148K, Rows: 124M (743 RLE)] (PATH ID: 2) Inner (BROADCAST) | | Join Cond: (lt.tstamp = delivery.tstamp) | | Materialize at Input: delivery.tstamp | | Materialize at Output: delivery.impressions | | Execute on: All Nodes | | +-- Outer -> JOIN HASH [LeftOuter] [Cost: 41K, Rows: 124M (421 RLE)] (PATH ID: 3) | | | Join Cond: (delivery.country = geo_countries.code) | | | Execute on: All Nodes | | | Runtime Filter: (SIP1(MergeJoin): delivery.tstamp) | | | +-- Outer -> STORAGE ACCESS for delivery [Cost: 2K, Rows: 124M (421 RLE)] (PATH ID: 4) | | | | Projection: rtb_production.delivery_DBD_1_seg_local_times2_b0 | | | | Materialize: delivery.country | | | | Filter: ((delivery.tstamp >= '2013-03-01 05:00:00'::timestamp) AND (delivery.tstamp <= '2013-04-01 03:59:5 9.999'::timestamp)) | | | | Execute on: All Nodes | | | +-- Inner -> STORAGE ACCESS for geo_countries [Cost: 161, Rows: 255] (PATH ID: 5) | | | | Projection: rtb_production.geo_countries_DBD_5_rep_rtb_rtb_node0001 | | | | Materialize: geo_countries.code | | | | Execute on: All Nodes | | +-- Inner -> SELECT [Cost: 56, Rows: 743] (PATH ID: 6) | | | Execute on: All Nodes | | | Execute on: All Nodes | | | +---> GROUPBY PIPELINED (GLOBAL RESEGMENT GROUPS) [Cost: 56, Rows: 743] (PATH ID: 8) | | | | Group By: delivery.tstamp | | | | Execute on: All Nodes | | | | +---> STORAGE ACCESS for delivery [Cost: 53, Rows: 124M (421 RLE)] (PATH ID: 9) | | | | | Projection: rtb_production.delivery_DBD_1_seg_local_times2_b0 | | | | | Materialize: delivery.tstamp | | | | | Filter: ((delivery.tstamp >= '2013-03-01 05:00:00'::timestamp) AND (delivery.tstamp <= '2013-04-01 03:59 :59.999'::timestamp)) | | | | | Execute on: All Nodes -- UNEXPECTED BEHAVIOUR (Group by: to_char("timezone"('America/New_York', "timezone"('UTC', tstamps.tstamp)), 'HH24')) -- explain analyse with local_times as (select tstamp,TO_CHAR(new_time(tstamp, 'UTC', 'America/New_York'), 'HH24') as hourOfDay from (select distinct tstamp from delivery where tstamp between '2013-03-01 05:00:00.000' and '2013-04-01 03:59:59.999') as tstamps) select lt.hourOfDay as hourOfDay, sum(delivery.impressions) as impressions from delivery inner join local_times lt on lt.tstamp = delivery.tstamp where delivery.tstamp between to_timestamp('2013-03-01 05:00:00.000', 'yyyy-MM-dd hh24:mi:ss.ff3') and to_timestamp('2013-04-01 03:59:59.999', 'yyyy-MM-dd hh24:mi:ss.ff3') group by hourOfDay; Access Path: +-GROUPBY HASH (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 83K, Rows: 2] (PATH ID: 1) | Aggregates: sum(delivery.impressions) | Group By: to_char("timezone"('America/New_York', "timezone"('UTC', tstamps.tstamp)), 'HH24') | Execute on: All Nodes | +---> JOIN MERGEJOIN(inputs presorted) [Cost: 83K, Rows: 124M (421 RLE)] (PATH ID: 2) Inner (BROADCAST) | | Join Cond: (tstamps.tstamp = delivery.tstamp) | | Materialize at Output: delivery.impressions | | Execute on: All Nodes | | +-- Outer -> STORAGE ACCESS for delivery [Cost: 53, Rows: 124M (421 RLE)] (PATH ID: 3) | | | Projection: rtb_production.delivery_DBD_1_seg_local_times2_b0 | | | Materialize: delivery.tstamp | | | Filter: ((delivery.tstamp >= '2013-03-01 05:00:00'::timestamp) AND (delivery.tstamp <= '2013-04-01 03:59:59. 999'::timestamp)) | | | Execute on: All Nodes | | | Runtime Filter: (SIP1(MergeJoin): delivery.tstamp) | | +-- Inner -> SELECT [Cost: 56, Rows: 743] (PATH ID: 4) | | | Execute on: All Nodes | | | +---> GROUPBY PIPELINED (GLOBAL RESEGMENT GROUPS) [Cost: 56, Rows: 743] (PATH ID: 5) | | | | Group By: delivery.tstamp | | | | Execute on: All Nodes | | | | +---> STORAGE ACCESS for delivery [Cost: 53, Rows: 124M (421 RLE)] (PATH ID: 6) | | | | | Projection: rtb_production.delivery_DBD_1_seg_local_times2_b0 | | | | | Materialize: delivery.tstamp | | | | | Filter: ((delivery.tstamp >= '2013-03-01 05:00:00'::timestamp) AND (delivery.tstamp <= '2013-04-01 03:59 :59.999'::timestamp)) | | | | | Execute on: All Nodes Would anyone have any ideas about what's going on here? (we're a bit baffled) Thanks, Emanuel Pordes
0
Comments