Live Aggregate Projection Limitations

mferrallmferrall Vertica Customer
edited May 2024 in General Discussion

I have a table with the following schema

utc_time timestamptz, -- this is always stored at the hourly grain in UTC
dim1 int,
dim2 varchar(150),
metric int

Data notes

  • Dim 2 is very high cardinality

Use cases

  • Frequently summarizing data for Dim1 by day at arbitrary time zones
  • Less often, drilling down into data at the day > dim1 > dim 2 level

I created a Live Aggregate Projection, as it seemed like it'd serve use cases 1 and 2 pretty well. Its defined

select
utc_time,
dim1
sum(metric) as metric
group by utc_time, dim1

However, when I query it like below, Vertica says it is infeasible to use the LAP (tested by providing a hint). Use of the data looks like

select date(utc_time at ?timezone) as day,
dim1,
sum(metric) as total
group by 1, 2

Question

  • Is Vertica able to use Live Aggregate Projections when you group by an expression like aggregating at a date level?
  • If it is, what am I doing wrong?
  • If it can not, would you model the data differently? What other tooling does Vertica offer to support a use case like this?

Answers

  • I'm wondering if the timezone designation is throwing it off. Timezone isn't immutable, so it could be problematic here.

    Is it possible to store the data with the timezone so that you don't have to do the conversion during the query?

  • mferrallmferrall Vertica Customer
    edited May 2024

    I just ran a quick test to see if timestamptz is whats throwing it off, converting it to timestamp. I found the same behavior, when I group by utc_time, dim1, it uses the LAP, when I group by date(utc_time),dim1, it does not. The time is the first column in the LAP group by. When I provide a hint that it should use the LAP, it says its infeasible. When I specify the LAP directly (in the from, like a table) the query runs with either grouping.

    To the second part of your comment, are you suggesting that I store date and timezone as separate columns, so it'd be a matter of filtering to pre-derived dates for a time zone of interest? That does make sense, we just have a huge number of timezones that we support. That might be our next step, I was just hoping for something more flexible, and was hoping I had more tooling to support this. If you (or others) have other thoughts on how to test this further, I'm all ears!

  • It's a little hard to debug over a forum like this. I'd need to dig deeper to figure out what's going on here, but I would suggest you open a support case to ensure there's nothing going on behind that could be causing it not to behave as you expect.

  • mferrallmferrall Vertica Customer

    Thank you for your input, I'll look into opening a ticket

  • marcothesanemarcothesane - Select Field - Administrator

    If the query groups by something else than the grouping expression of the Live Aggregate Projection, then the LAP won't be used.

    The LAP groups by:
    utc_time
    , dim1

    The query groups by:
    date(utc_time at ?timezone)
    , dim1

    The optimiser has no reason to believe that the data in the LAP is partially pre-aggregated to satisfy the query.

  • mferrallmferrall Vertica Customer

    @marcothesane This remains true even when the field I've aggregated away, dim2 in this example, is very high cardinality? This isn't intuitive to me, but I'll think about it more.

    Beyond that, does Vertica offer any tooling to improve performance for a use case like I've described, repeated below

    • Serving data at arbitrary time zones at a daily aggregate (roughly requires storing at hourly grain)
    • Pre-aggregating to reduce the impact of high cardinality fields for some use cases

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file