Can I get group by pipelined when grouping by date(timestamp) ?

I am trying to optimize the performance of a query like this:

select date(event_time) date,
agent_id,
sum(duration_days) duration
from rpt_fa_agent_activity
where account_id='account1'
and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00'
group by date(event_time),
agent_id

I created a dedicated projection which is ordered by account_id, agent_id and event_time.

The query plan uses the projection but is doing GROUPBY HASH.
If I remove the date function from the query the plan is with GROUPBY PIPELINED.

Is there a way to get GROUPBY PIPELINED without modifying the query or for a similar query which returns the same results?

Comments

  • Its hard to make specific suggestion without knowing the projections and col types.

    Have you had a chance to see this document, it may provide some hints/clues
    https://my.vertica.com/docs/7.0.x/PDF/HP_Vertica_7.0.x_Programmers_Guide.pdf

    see the section on 
    "Avoiding GROUPBY HASH with Projection Design"
  • Dahna, 
    If you projection is ordered by account_id, agent_id and event_time, you will not be able to get a  GROUPBY PIPELINED  using the event_time to group by.
    select date(event_time) date,
    agent_id,
    sum(duration_days) duration
    from rpt_fa_agent_activity3
    where account_id=1
    and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00'
    group by account_id,agent_id,event_time;
    This code will give you a pipelined groupby as it satisfies the projection construction order by.
    Like Sumeet pointed out you get good examples on how you can avoid GROUPBY HASH in Vertica online docs.
  • Hi Sumeet and Adrian,
    Thanks for replying.

    The problem with my query is applying the date function on a group by colum.
    If I remove it I do get GROUPBY PIPELINED.

    My question is - is there any way at all to get GROUPBY PIPELINED when applying a function on a group by column?

  •   I don't think is possible as is a complete different set of data(which is not taken in count by the initial order by - when creating the table/projection) and you cannot order a projection of a table by a function(col_name).
    One thing you can do is to check what is your been done when your query runs by looking at the v_monitor.execution_engine_profiles.
    See example:
    -- create dummy table 
    create table one (d1 date
    )
    order by d1;
    insert into one values (getdate());
    insert into one values (getdate());
    insert into one values (getdate());
    insert into one values (getdate());
    insert into one values (getdate()); commit;
    --run query with GROUPBY PIPELINED
    --run profile 
    dbadmin=> profile select  d1 as  date from one group by d1;NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273706907 and statement_id=28;
    NOTICE 3557:  Initiator memory for query: [on pool general: 5824 KB, minimum: 5824 KB]
    NOTICE 5077:  Total memory required by query: [5824 KB]
        date
    ------------
     2014-05-30
    (1 row)
    --check for for used operators.
    Time: First fetch (1 row): 10.696 ms. All rows formatted: 10.735 ms
    dbadmin=> select distinct operator_name from v_monitor.execution_engine_profiles where transaction_id=45035996273706907 and statement_id=28;
     operator_name
    ---------------
     StorageUnion
     Scan
     GroupByPipe
     Root
     NewEENode
    (5 rows)
    --run query with GROUPBY HASH
    --run profile 
     
    dbadmin=> profile select  d1 as  date from one group by d1,date(d1);NOTICE 4788:  Statement is being profiled
    HINT:  Select * from v_monitor.execution_engine_profiles where transaction_id=45035996273706907 and statement_id=31;
    NOTICE 3557:  Initiator memory for query: [on pool general: 122353 KB, minimum: 122353 KB]
    NOTICE 5077:  Total memory required by query: [122353 KB]
        date
    ------------
     2014-05-30
    (1 row)
    --check the operators now 
    Time: First fetch (1 row): 21.453 ms. All rows formatted: 21.500 ms
    dbadmin=> select distinct operator_name from v_monitor.execution_engine_profiles where transaction_id=45035996273706907 and statement_id=31;
     operator_name
    ---------------
     GroupByHash
     ParallelUnion
     StorageUnion
     Scan
     ExprEval
     GroupByPipe
     Root
     NewEENode
    (8 rows)
    I don't know how much sense this will make to you but as you can see more operations are required as (GroupByHash,ExprEval,ExprEval)
    This is only because there is an extra row in there it has to resolve. 


  • I believe this example descibes your particular case
    create table rpt_fa_agent_activity (event_time timestamp , agent_id int,  duration_days int,  account_id int );
    create projection rpt_fa_agent_activity_p as select *  from rpt_fa_agent_activity order by account_id, agent_id, event_time;

    insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
    insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
    insert into rpt_fa_agent_activity values ('2014-01-14 00:00:00', 1 , 12,1);
    insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);
    insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);
    insert into rpt_fa_agent_activity values ('2014-05-20 00:00:00', 1 , 22,1);

    -- This uses GroupByHash
    select date(event_time) date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=12 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by date(event_time), agent_id;

    -- This uses GroupByPipe
    select event_time date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by event_time, agent_id;



    The use of GroupByPipe is predicated on the sort order. The application of a function may change the sort order. 
    Hence, generally its not correct to use GbyPipe unless its known aprori that the output produced by the function will be sorted the same way as the underlying projection/input.

    The date function may take a varchar as a argument and return a date type. If the underlying type is varchar is lexically (alphabetically) sorted, however the date function return a date/int type which is not be sorted anymore. Hence as a generaly rule GbyHash is the correct operator when a function is specificed in a gby clause. 

    lexically sorted
    Dec 2012
    Jan 2012
    Jan 2013

    sorted as date
    Jan 2012
    Dec 2012
    Jan 2013



    If the reason you are modifying the query is for the presentation of the data field, i believe there may be a way

    skeswani=> select event_time date, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by event_time, agent_id;
            date         | agent_id | duration 
    ---------------------+----------+----------
     2014-01-14 00:00:00 |        1 |       36
     2014-05-20 00:00:00 |        1 |       66
    (2 rows)


    skeswani=> select date(event_time), agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by date(event_time), agent_id;
        date    | agent_id | duration 
    ------------+----------+----------
     2014-05-20 |        1 |       66
     2014-01-14 |        1 |       36
    (2 rows)



    -- This may work for you

    skeswani=> \o | grep Pipe
    skeswani=> explain select date(event_time), agent_id, duration  from ( select event_time, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by 1, 2 ) as A;
    skeswani=>  4[label = "GroupByPipe: 2 keys\nAggs:\n  sum(rpt_fa_agent_activity.duration_days)\nUnc: Integer(8)\nUnc: Timestamp(8)\nUnc: Integer(8)", color = "green", shape = "box"];
                    6[label = "GroupByPipe: 2 keys\nAggs:\n  sum(rpt_fa_agent_activity.duration_days)\nUnc: Integer(8)\nUnc: Timestamp(8)\nUnc: Integer(8)", color = "brown", shape = "box"];
    skeswani=> \o
    skeswani=> select date(event_time), agent_id, duration  from ( select event_time, agent_id, sum(duration_days) duration from rpt_fa_agent_activity where account_id=1 and event_time between '2014-01-14 00:00:00' and '2014-05-21 00:00:00' group by 1, 2 ) as A;
        date    | agent_id | duration 
    ------------+----------+----------
     2014-01-14 |        1 |       36
     2014-05-20 |        1 |       66
    (2 rows)


    -- alternatively you may consider using a view to achive the above effective subquery.
  • I loved you example, but i don't see where you are grouping by date(xxx) 
  • I dont use "group by date(XXX)", probably because i dont believe you need to group by the output of the function. In this specific case, you can group by the original sorted data. the grouping will be identical. you can then apply the function on the output.

    It attempts to work around the issue.

    Also, GbyHash is the correct operator if the grouping is done by a function, due to the reason i mentioned above. I suspect the user believes that GbyPipe offers some performance benefit and want to coerce the query to use that, and this would enable them to do so.
  • Yes you are correct, i was not seeing  the group by. Nice approach.
  • on second thought  a group by date(XXX) will be required (thanks for pointing it out).
    [ sorry my work around is not really correct ]

    if the event_time is a timestamp (i.e. has a higher resolution/granularity  than date); then the groups wont be identical.

    I guess the GbyHash is required for this exact reason, unless event_time is a varchar and only contains the date.



  • Thank you both.
  • It's best to pull out the date into a separate column during the load process.  Queries and aggregations on date are very column. Then you can easily optimize for GROUP BY PIPELINE and have more efficient query predicates.

      --Sharon


  • in 7.1 vertica provide projection base expresion , which will do what you need ( like function base index in Oracle) .

    Very soon will be avalable !
  • We can't add a date column to the table because we need to query by different time zones.
    The event_time column type is timestamptz and we execute the "set timezone" statement before executing the query.

  • @sumeet_1

     

    I believe Vertica needs to add the concept of increasing/decreasing monotonic functions flag (it already know the immutable concept) in order to hint the query optimizer if a function may or not change the ordering of the underlaying columns.

     

    For example, taking the column as-is without applying a function, is equivalent to the identity function which is  monotonic increasing. In this particular case, applying the DATE function to a TIMESTAMP column is a monotonic (as is DATE_TRUNC).

     

    This would be an improvement since I suspect the first column of fact tables is almost always a TIMESTAMP (when timezone are involved).

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file