SALT in Segmentation Clause
In my domain, we have very skewed data, which makes segmentation across a cluster difficult in a way that priortizes both disk acess and the ability to perform aggregations on a node prior to redistribution across the cluster. In Spark, you might use a SALT to work around this, where you add some random column to distribute data, include that SALT in local groupings, then drop it in any aggregations that are network wide.
Take this example (assume that other_dim1 and other_dim2 are loosely correlated to skewed_dimension)
CREATE TABLE Schema.Test ( utc_time timestamptz NOT NULL, skewed_dimension int, other_dim1 int, other_dim2 int, SALT int, metric numeric(19,4) ) PARTITION BY SomeHeirchalFunction CREATE PROJECTION SCHEMA.Test_super( utc_time RLE, skewed_dimension RLE, other_dim1 RLE, other_dim2, SALT, metric ) AS SELECT utc_time, skewed_dimension, other_dim1, other_dim2, SALT, metric FROM Schema.Test ORDER BY utc_time, skewed_dimension, other_dim1, other_dim2, SALT (maybe?) SEGMENTED BY HASH(skewed_dimension,SALT)
Then, for some query where you group by date (not hour!), you might do this
SELECT date, skewed_dimension, other_dim2 sum(metric) as metric FROM ( SELECT date(utc_time) as date, skewed_dimension, other_dim2, sum(metric) as metric FROM Schema.Test WHERE utc_time BETWEEN '2022-01-01 00:00:00' AND '2022-02-01 00:00:00' AND skewed_dimension IN (1,2,3,4,5,6,7) GROUP BY date(utc_time), skewed_dimension, other_dim2 ) GROUP BY date, skewed_dimension, other_dim2
My understanding is that given a sufficiently large SALT, this should perform quite well, where data would be filtered effectively, and aggregation could be done locally on a node prior to resegmentation.
Is this a strategy that people are using for data have segmentation schemes that are relatively high value, balancing both disk access and leveraging the nodes for compute prior to resegmentation? What alternative approaches to this balance are there?