How to design those tables and rewrite this query?

I have this question:

For 2 tables:
1. events:
event_id int (autoincrement) --10B distinct values
event_ts datetime -- 10B
event_type int (1 = impression, 2 = click, 3 = purchase...) --20
product_id int --100K
client_id int --10M
client_type int --10
event_platform --100

2. products:
product_id int
product_name varchar(20)
product_type int -- 10000

I need to run this query:
select product_type, client_type, datetrunc('month',event_ts), count(distinct client_id), count(*)
from events join products using(product_id)
where event_ts > ...
and product_type in (100 values)
and event_platform in (3 values)
and event_type in (2,3)
group by product_type, client_type, datetrunc('month',event_ts)
having count(*) > 10000

What will be the optimal design for the 2 tables? (order by / partition / shading)
How would you rewrite this query?

Answers

  • ppcagencyppcagency Community Edition User

    How to write the query for the same data. I am a new member and I need to work on the basics. Kindly help. :'(

  • Bryan_HBryan_H Vertica Employee Administrator

    Encoding: Set all fields to ZSTD_FAST_COMP to improve load and query times.
    Segmentation: the JOIN condition is product_id, so tables should be segmented by product_id so values are colocated. If one table is much smaller, e.g. if products is used as a dimension table smaller than 10M rows or so, it should be unsegmented.
    Partition: events contains a timestamp, so it's recommended to partition by DATE(event_ts) to improve scan performance by splitting days into separate containers.
    Order and group: it may help to order or group by low cardinality fields to improve WHERE clause performance. For example, if event_type has few values, ORDER BY event_type may improve performance of queries with WHERE event_type in (few values).

    You can run an EXPLAIN on any query to show execution steps and identify possible bottlenecks.

    There's a detailed description of query optimization for many scenarios at https://www.vertica.com/docs/12.0.x/HTML/Content/Authoring/AnalyzingData/Optimizations/OptimizingQueryPerformance.htm

  • YossiDroriYossiDrori Vertica Customer

    Thanks.
    If I choose for the product table : "unsegmented":

    • Is it relevant to segment the events table by product_id?
    • What about the **group by ** clause

    I've changed the query like this - in order for the execution to start with the events table and do the group by only on it first
    select sum(cnt),product_type, client_type, m
    from (
    select product_id, client_type, datetrunc('month',event_ts) as m
    count(distinct client_id), count(*) as cnt
    from events
    where event_ts > ...
    and event_platform in (3 values)
    and event_type in (2,3)
    group by product_id, client_type, datetrunc('month',event_ts)
    ) e
    join (select product_id,product_type
    where product_type in (100 values)
    from products) p using(product_id)
    group by p.product_type, e.client_type, e.m
    having sum(cnt) > 10000

  • Bryan_HBryan_H Vertica Employee Administrator

    For segmentation: you should always segment fact tables on join keys even if the dimension table is unsegmented. Segmenting the fact table helps distribute the data evenly across the cluster in order to improve distributed compute performance with large tables. Please see detail at https://www.vertica.com/docs/12.0.x/HTML/Content/Authoring/AdministratorsGuide/Partitions/PartitioningAndSegmentingData.htm
    For best performance, all GROUP BY columns should be listed in the projection's ORDER BY clause. Please see details at https://www.vertica.com/docs/12.0.x/HTML/Content/Authoring/AnalyzingData/SQLAnalytics/AvoidingGROUPBYHASHWithProjectionDesign.htm
    Since the datetime column "event_ts" is used as a predicate, it will help performance to partition the events table on DATE(event_ts), as noted earlier.

  • VValdarVValdar Vertica Employee Employee

    Hi YossiDrori,

    In addition to Bryan_H answers, I would suggest you to consider using a flattened table - it will slow a bit the loading time but you'll avoid the join in your query.

    Partition would depend on how many data you have per day; month is probably an ok level to start with, it will really depends of the kind of query you'll run. You should check the CALENDAR_HIERARCHY_DAY feature, could make sense to you.

    You could also make a query-optimized projection for your query, and if you query mostly recent data make it a Partition Range Projections.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file