SALT in Segmentation Clause
In my domain, we have very skewed data, which makes segmentation across a cluster difficult in a way that priortizes both disk acess and the ability to perform aggregations on a node prior to redistribution across the cluster. In Spark, you might use a SALT to work around this, where you add some random column to distribute data, include that SALT in local groupings, then drop it in any aggregations that are network wide.
Take this example (assume that other_dim1 and other_dim2 are loosely correlated to skewed_dimension)
CREATE TABLE Schema.Test ( utc_time timestamptz NOT NULL, skewed_dimension int, other_dim1 int, other_dim2 int, SALT int, metric numeric(19,4) ) PARTITION BY SomeHeirchalFunction CREATE PROJECTION SCHEMA.Test_super( utc_time RLE, skewed_dimension RLE, other_dim1 RLE, other_dim2, SALT, metric ) AS SELECT utc_time, skewed_dimension, other_dim1, other_dim2, SALT, metric FROM Schema.Test ORDER BY utc_time, skewed_dimension, other_dim1, other_dim2, SALT (maybe?) SEGMENTED BY HASH(skewed_dimension,SALT)
Then, for some query where you group by date (not hour!), you might do this
SELECT date, skewed_dimension, other_dim2 sum(metric) as metric FROM ( SELECT date(utc_time) as date, skewed_dimension, other_dim2, sum(metric) as metric FROM Schema.Test WHERE utc_time BETWEEN '2022-01-01 00:00:00' AND '2022-02-01 00:00:00' AND skewed_dimension IN (1,2,3,4,5,6,7) GROUP BY date(utc_time), skewed_dimension, other_dim2 ) GROUP BY date, skewed_dimension, other_dim2
My understanding is that given a sufficiently large SALT, this should perform quite well, where data would be filtered effectively, and aggregation could be done locally on a node prior to resegmentation.
Is this a strategy that people are using for data have segmentation schemes that are relatively high value, balancing both disk access and leveraging the nodes for compute prior to resegmentation? What alternative approaches to this balance are there?
Just for the sake of preciseness, beware the use of partitioning as it has a different meaning that segmenting.
The later one is for data distribution across nodes - its goal is to ensure proper parallelism - the former one is to group data in different ROS inside each node based on the partition column expression, its goal is to reduce I/O when dealing with data.
I found this query here:
select floor(hash(utc_time, skewed_dimension)/((9223372036854775807/(select count(*) from nodes)) + 1)) as node , count(*) as row_count from Test group by 1 order by 1;
But it seems more an estimate than a real value, they must have changed the segmentation formula in an intermediate version.
I'll try to find something more accurate
For your second point, given your inputs timestamtz is the proper choice, no discussion here.0
Have you tried loading a sample of data and running Database Designer using that sample table and one or more queries as samples? Low cardinality columns may not impact distribution as much as you think, unless they're the only hash column. It might suffice to have the skewed column first to ensure each value is colocated if that makes sense, but add other dimension columns to increase the hash range. You can see the segmentation in storage_locations with a query like:
select distinct projection_name, node_name, segment_lower_bound, segment_upper_bound from storage_containers where projection_name = order by 1,2 limit 10;
Internally, we've developed and run an air traffic monitoring demo that collects ADS-B data and displays a number of metrics. In this case, there are several fields that have low cardinality like boolean and enumerated type fields. A projection segmentation combining these in order where they're typically used for select and join has proved usable. Also, a Live Aggregate Projection is another useful way to pre-compute aggregates in the background so the impact of skewed data is less visible as it's resolved within the LAP: https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/AnalyzingData/AggregatedData/LiveAggregateProjectionExample.htm
I was giving some thoughts around this use case, can't the utc_time column be serving as salt?
It should give proper distribution accross the nodes, maybe you can use just a part of the utc_time column like the hour:
Oh btw, if the column utc_time is at time 'UTC' maybe you don't need the timestamp with time zone format, timestamp should be enough (and the DDL is easier):
For the record I did try those commands on a 11.0 version.
Hi @VValdar ,
Yes, I think that would work. The full context of this is that we currently partition by 5+ columns, and I thought to use 'salt' as a replacement for some of those columns. But utc_time is well distributed enough that I could just simplify to utc_time and skewed_dimension.
Two followup questions:
Would this be a decent query for estimating data distribution on a 10 node query if I were partitioning on utc_time and skewed_dimension?
Second, Could you explain more about the benefit of the timestamp vs timestamptz? Our use case is that we
date(utc_time at time zone 'America/New_York')
Would storing data in timestamps be more performant or offer other benefits?
Thanks for your help!