Can I get group by pipelined when grouping by date(timestamp) ?
I am trying to optimize the performance of a query like this:
select date(event_time) date,
agent_id,
sum(duration_days) duration
from rpt_fa_agent_activity
where account_id='account1'
and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00'
group by date(event_time),
agent_id
I created a dedicated projection which is ordered by account_id, agent_id and event_time.
The query plan uses the projection but is doing GROUPBY HASH.
If I remove the date function from the query the plan is with GROUPBY PIPELINED.
Is there a way to get GROUPBY PIPELINED without modifying the query or for a similar query which returns the same results?
select date(event_time) date,
agent_id,
sum(duration_days) duration
from rpt_fa_agent_activity
where account_id='account1'
and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00'
group by date(event_time),
agent_id
I created a dedicated projection which is ordered by account_id, agent_id and event_time.
The query plan uses the projection but is doing GROUPBY HASH.
If I remove the date function from the query the plan is with GROUPBY PIPELINED.
Is there a way to get GROUPBY PIPELINED without modifying the query or for a similar query which returns the same results?
0
Comments
Have you had a chance to see this document, it may provide some hints/clues
https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_Programmers_Guide.pdf
see the section on
"Avoiding GROUPBY HASH with Projection Design"
If you projection is ordered by account_id, agent_id and event_time, you will not be able to get a GROUPBY PIPELINED using the event_time to group by.
This code will give you a pipelined groupby as it satisfies the projection construction order by.
Like Sumeet pointed out you get good examples on how you can avoid GROUPBY HASH in Vertica online docs.
Thanks for replying.
The problem with my query is applying the date function on a group by colum.
If I remove it I do get GROUPBY PIPELINED.
My question is - is there any way at all to get GROUPBY PIPELINED when applying a function on a group by column?
One thing you can do is to check what is your been done when your query runs by looking at the v_monitor.execution_engine_profiles.
See example:
-- create dummy table --run query with GROUPBY PIPELINED
--run profile --check for for used operators. --run query with GROUPBY HASH
--run profile
--check the operators now I don't know how much sense this will make to you but as you can see more operations are required as (GroupByHash,ExprEval,ExprEval)
This is only because there is an extra row in there it has to resolve.
I believe this example descibes your particular case
create table rpt_fa_agent_activity (event_time timestamp , agent_id int, duration_days int, account_id int );
create projection rpt_fa_agent_activity_p as select * from rpt_fa_agent_activity order by account_id, agent_id, event_time;
insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);
insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);
insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);
-- This uses GroupByHash
select date(event_time) date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=12 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by date(event_time), agent_id;
-- This uses GroupByPipe
select event_time date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by event_time, agent_id;
The use of GroupByPipe is predicated on the sort order. The application of a function may change the sort order.
Hence, generally its not correct to use GbyPipe unless its known aprori that the output produced by the function will be sorted the same way as the underlying projection/input.
The date function may take a varchar as a argument and return a date type. If the underlying type is varchar is lexically (alphabetically) sorted, however the date function return a date/int type which is not be sorted anymore. Hence as a generaly rule GbyHash is the correct operator when a function is specificed in a gby clause.
lexically sorted
Dec 2012
Jan 2012
Jan 2013
sorted as date
Jan 2012
Dec 2012
Jan 2013
If the reason you are modifying the query is for the presentation of the data field, i believe there may be a way
skeswani=> select event_time date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by event_time, agent_id;
date | agent_id | duration
---------------------+----------+----------
2014-01-14 00:00:00 | 1 | 36
2014-05-20 00:00:00 | 1 | 66
(2 rows)
skeswani=> select date(event_time), agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by date(event_time), agent_id;
date | agent_id | duration
------------+----------+----------
2014-05-20 | 1 | 66
2014-01-14 | 1 | 36
(2 rows)
-- This may work for you
skeswani=> \o | grep Pipe
skeswani=> explain select date(event_time), agent_id, duration from ( select event_time, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by 1, 2 ) as A;
skeswani=> 4[label = "GroupByPipe: 2 keys\nAggs:\n sum(rpt_fa_agent_activity.duration_days)\nUnc: Integer(8)\nUnc: Timestamp(8)\nUnc: Integer(8)", color = "green", shape = "box"];
6[label = "GroupByPipe: 2 keys\nAggs:\n sum(rpt_fa_agent_activity.duration_days)\nUnc: Integer(8)\nUnc: Timestamp(8)\nUnc: Integer(8)", color = "brown", shape = "box"];
skeswani=> \o
skeswani=> select date(event_time), agent_id, duration from ( select event_time, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by 1, 2 ) as A;
date | agent_id | duration
------------+----------+----------
2014-01-14 | 1 | 36
2014-05-20 | 1 | 66
(2 rows)
-- alternatively you may consider using a view to achive the above effective subquery.
It attempts to work around the issue.
Also, GbyHash is the correct operator if the grouping is done by a function, due to the reason i mentioned above. I suspect the user believes that GbyPipe offers some performance benefit and want to coerce the query to use that, and this would enable them to do so.
[ sorry my work around is not really correct ]
if the event_time is a timestamp (i.e. has a higher resolution/granularity than date); then the groups wont be identical.
I guess the GbyHash is required for this exact reason, unless event_time is a varchar and only contains the date.
--Sharon
Very soon will be avalable !
The event_time column type is timestamptz and we execute the "set timezone" statement before executing the query.
@sumeet_1
I believe Vertica needs to add the concept of increasing/decreasing monotonic functions flag (it already know the immutable concept) in order to hint the query optimizer if a function may or not change the ordering of the underlaying columns.
For example, taking the column as-is without applying a function, is equivalent to the identity function which is monotonic increasing. In this particular case, applying the DATE function to a TIMESTAMP column is a monotonic (as is DATE_TRUNC).
This would be an improvement since I suspect the first column of fact tables is almost always a TIMESTAMP (when timezone are involved).