Can I get group by pipelined when grouping by date(timestamp) ?

dafna · June 2014

I am trying to optimize the performance of a query like this:

select date(event_time) date,
agent_id,
sum(duration_days) duration
from rpt_fa_agent_activity
where account_id='account1'
and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00'
group by date(event_time),
agent_id

I created a dedicated projection which is ordered by account_id, agent_id and event_time.

The query plan uses the projection but is doing GROUPBY HASH.
If I remove the date function from the query the plan is with GROUPBY PIPELINED.

Is there a way to get GROUPBY PIPELINED without modifying the query or for a similar query which returns the same results?

[Deleted User] · June 2014

Its hard to make specific suggestion without knowing the projections and col types.

Have you had a chance to see this document, it may provide some hints/clues
https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_Programmers_Guide.pdf

see the section on
"Avoiding GROUPBY HASH with Projection Design"

Adrian_Oprea_1 · June 2014

Dahna,
If you projection is ordered by account_id, agent_id and event_time, you will not be able to get a GROUPBY PIPELINED using the event_time to group by.

select date(event_time) date,
agent_id,
sum(duration_days) duration
from rpt_fa_agent_activity3
where account_id=1
and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00'
group by account_id,agent_id,event_time;

This code will give you a pipelined groupby as it satisfies the projection construction order by.
Like Sumeet pointed out you get good examples on how you can avoid GROUPBY HASH in Vertica online docs.

dafna · June 2014

Hi Sumeet and Adrian,
Thanks for replying.

The problem with my query is applying the date function on a group by colum.
If I remove it I do get GROUPBY PIPELINED.

My question is - is there any way at all to get GROUPBY PIPELINED when applying a function on a group by column?

Adrian_Oprea_1 · June 2014

I don't think is possible as is a complete different set of data(which is not taken in count by the initial order by - when creating the table/projection) and you cannot order a projection of a table by a function(col_name).
One thing you can do is to check what is your been done when your query runs by looking at the v_monitor.execution_engine_profiles.
See example:
-- create dummy table

create table one (d1 date
)
order by d1;
insert into one values (getdate());
insert into one values (getdate());
insert into one values (getdate());
insert into one values (getdate());
insert into one values (getdate()); commit;

--run query with GROUPBY PIPELINED
--run profile

dbadmin=> profile select  d1 as  date from one group by d1;NOTICE 4788:  Statement is being profiled
HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273706907 and statement_id=28;
NOTICE 3557:  Initiator memory for query: [on pool general: 5824 KB, minimum: 5824 KB]
NOTICE 5077:  Total memory required by query: [5824 KB]
    date
------------
 2014-05-30
(1 row)

--check for for used operators.

Time: First fetch (1 row): 10.696 ms. All rows formatted: 10.735 ms
dbadmin=> select distinct operator_name from v_monitor.execution_engine_profiles where transaction_id=45035996273706907 and statement_id=28;
 operator_name
---------------
 StorageUnion
 Scan
 GroupByPipe
 Root
 NewEENode
(5 rows)

--run query with GROUPBY HASH
--run profile

dbadmin=> profile select  d1 as  date from one group by d1,date(d1);NOTICE 4788:  Statement is being profiled
HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273706907 and statement_id=31;
NOTICE 3557:  Initiator memory for query: [on pool general: 122353 KB, minimum: 122353 KB]
NOTICE 5077:  Total memory required by query: [122353 KB]
    date
------------
 2014-05-30
(1 row)

--check the operators now

Time: First fetch (1 row): 21.453 ms. All rows formatted: 21.500 ms
dbadmin=> select distinct operator_name from v_monitor.execution_engine_profiles where transaction_id=45035996273706907 and statement_id=31;
 operator_name
---------------
 GroupByHash
 ParallelUnion
 StorageUnion
 Scan
 ExprEval
 GroupByPipe
 Root
 NewEENode
(8 rows)

I don't know how much sense this will make to you but as you can see more operations are required as (GroupByHash,ExprEval,ExprEval)
This is only because there is an extra row in there it has to resolve.

[Deleted User] · June 2014

I believe this example descibes your particular case
create table rpt_fa_agent_activity (event_time timestamp , agent_id int, duration_days int, account_id int );
create projection rpt_fa_agent_activity_p as select * from rpt_fa_agent_activity order by account_id, agent_id, event_time;

insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);
insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);
insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);

-- This uses GroupByHash
select date(event_time) date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=12 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by date(event_time), agent_id;

-- This uses GroupByPipe
select event_time date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by event_time, agent_id;

The use of GroupByPipe is predicated on the sort order. The application of a function may change the sort order.
Hence, generally its not correct to use GbyPipe unless its known aprori that the output produced by the function will be sorted the same way as the underlying projection/input.

The date function may take a varchar as a argument and return a date type. If the underlying type is varchar is lexically (alphabetically) sorted, however the date function return a date/int type which is not be sorted anymore. Hence as a generaly rule GbyHash is the correct operator when a function is specificed in a gby clause.

lexically sorted
Dec 2012
Jan 2012
Jan 2013

sorted as date
Jan 2012
Dec 2012
Jan 2013

If the reason you are modifying the query is for the presentation of the data field, i believe there may be a way

skeswani=> select event_time date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by event_time, agent_id;
date | agent_id | duration
---------------------+----------+----------
2014-01-14 00:00:00 | 1 | 36
2014-05-20 00:00:00 | 1 | 66
(2 rows)

skeswani=> select date(event_time), agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by date(event_time), agent_id;
date | agent_id | duration
------------+----------+----------
2014-05-20 | 1 | 66
2014-01-14 | 1 | 36
(2 rows)

-- This may work for you

skeswani=> \o | grep Pipe
skeswani=> explain select date(event_time), agent_id, duration from ( select event_time, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by 1, 2 ) as A;
skeswani=> 4[label = "GroupByPipe: 2 keys\nAggs:\n sum(rpt_fa_agent_activity.duration_days)\nUnc: Integer(8)\nUnc: Timestamp(8)\nUnc: Integer(8)", color = "green", shape = "box"];
6[label = "GroupByPipe: 2 keys\nAggs:\n sum(rpt_fa_agent_activity.duration_days)\nUnc: Integer(8)\nUnc: Timestamp(8)\nUnc: Integer(8)", color = "brown", shape = "box"];
skeswani=> \o
skeswani=> select date(event_time), agent_id, duration from ( select event_time, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by 1, 2 ) as A;
date | agent_id | duration
------------+----------+----------
2014-01-14 | 1 | 36
2014-05-20 | 1 | 66
(2 rows)

-- alternatively you may consider using a view to achive the above effective subquery.

Adrian_Oprea_1 · June 2014

I loved you example, but i don't see where you are grouping by date(xxx)

[Deleted User] · June 2014

I dont use "group by date(XXX)", probably because i dont believe you need to group by the output of the function. In this specific case, you can group by the original sorted data. the grouping will be identical. you can then apply the function on the output.

It attempts to work around the issue.

Also, GbyHash is the correct operator if the grouping is done by a function, due to the reason i mentioned above. I suspect the user believes that GbyPipe offers some performance benefit and want to coerce the query to use that, and this would enable them to do so.

Adrian_Oprea_1 · June 2014

Yes you are correct, i was not seeing the group by. Nice approach.

[Deleted User] · June 2014

on second thought a group by date(XXX) will be required (thanks for pointing it out).
[ sorry my work around is not really correct ]

if the event_time is a timestamp (i.e. has a higher resolution/granularity than date); then the groups wont be identical.

I guess the GbyHash is required for this exact reason, unless event_time is a varchar and only contains the date.

dafna · June 2014

Thank you both.

Sharon_Cutter · June 2014

It's best to pull out the date into a separate column during the load process. Queries and aggregations on date are very column. Then you can easily optimize for GROUP BY PIPELINE and have more efficient query predicates.

--Sharon

eli_revach_1 · June 2014

in 7.1 vertica provide projection base expresion , which will do what you need ( like function base index in Oracle) .

Very soon will be avalable !

dafna · July 2014

We can't add a date column to the table because we need to query by different time zones.
The event_time column type is timestamptz and we execute the "set timezone" statement before executing the query.

fsaintjacques · August 2015

@sumeet_1

I believe Vertica needs to add the concept of increasing/decreasing monotonic functions flag (it already know the immutable concept) in order to hint the query optimizer if a function may or not change the ordering of the underlaying columns.

For example, taking the column as-is without applying a function, is equivalent to the identity function which is monotonic increasing. In this particular case, applying the DATE function to a TIMESTAMP column is a monotonic (as is DATE_TRUNC).

This would be an improvement since I suspect the first column of fact tables is almost always a TIMESTAMP (when timezone are involved).

Can I get group by pipelined when grouping by date(timestamp) ?

Comments

Leave a Comment