SALT in Segmentation Clause
Hi All,
In my domain, we have very skewed data, which makes segmentation across a cluster difficult in a way that priortizes both disk acess and the ability to perform aggregations on a node prior to redistribution across the cluster. In Spark, you might use a SALT to work around this, where you add some random column to distribute data, include that SALT in local groupings, then drop it in any aggregations that are network wide.
Take this example (assume that other_dim1 and other_dim2 are loosely correlated to skewed_dimension)
CREATE TABLE Schema.Test ( utc_time timestamptz NOT NULL, skewed_dimension int, other_dim1 int, other_dim2 int, SALT int, metric numeric(19,4) ) PARTITION BY SomeHeirchalFunction CREATE PROJECTION SCHEMA.Test_super( utc_time RLE, skewed_dimension RLE, other_dim1 RLE, other_dim2, SALT, metric ) AS SELECT utc_time, skewed_dimension, other_dim1, other_dim2, SALT, metric FROM Schema.Test ORDER BY utc_time, skewed_dimension, other_dim1, other_dim2, SALT (maybe?) SEGMENTED BY HASH(skewed_dimension,SALT)
Then, for some query where you group by date (not hour!), you might do this
SELECT date, skewed_dimension, other_dim2 sum(metric) as metric FROM ( SELECT date(utc_time) as date, skewed_dimension, other_dim2, sum(metric) as metric FROM Schema.Test WHERE utc_time BETWEEN '2022-01-01 00:00:00' AND '2022-02-01 00:00:00' AND skewed_dimension IN (1,2,3,4,5,6,7) GROUP BY date(utc_time), skewed_dimension, other_dim2 ) GROUP BY date, skewed_dimension, other_dim2
My understanding is that given a sufficiently large SALT, this should perform quite well, where data would be filtered effectively, and aggregation could be done locally on a node prior to resegmentation.
Is this a strategy that people are using for data have segmentation schemes that are relatively high value, balancing both disk access and leveraging the nodes for compute prior to resegmentation? What alternative approaches to this balance are there?
Leave a Comment
Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.
Answers
Have you tried loading a sample of data and running Database Designer using that sample table and one or more queries as samples? Low cardinality columns may not impact distribution as much as you think, unless they're the only hash column. It might suffice to have the skewed column first to ensure each value is colocated if that makes sense, but add other dimension columns to increase the hash range. You can see the segmentation in storage_locations with a query like:
select distinct projection_name, node_name, segment_lower_bound, segment_upper_bound from storage_containers where projection_name = order by 1,2 limit 10;
Internally, we've developed and run an air traffic monitoring demo that collects ADS-B data and displays a number of metrics. In this case, there are several fields that have low cardinality like boolean and enumerated type fields. A projection segmentation combining these in order where they're typically used for select and join has proved usable. Also, a Live Aggregate Projection is another useful way to pre-compute aggregates in the background so the impact of skewed data is less visible as it's resolved within the LAP: https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/AnalyzingData/AggregatedData/LiveAggregateProjectionExample.htm