SALT in Segmentation Clause

mferrall · June 2022

Hi All,
In my domain, we have very skewed data, which makes segmentation across a cluster difficult in a way that priortizes both disk acess and the ability to perform aggregations on a node prior to redistribution across the cluster. In Spark, you might use a SALT to work around this, where you add some random column to distribute data, include that SALT in local groupings, then drop it in any aggregations that are network wide.

Take this example (assume that other_dim1 and other_dim2 are loosely correlated to skewed_dimension)

CREATE TABLE Schema.Test (
  utc_time timestamptz NOT NULL,
  skewed_dimension int,
  other_dim1 int,
  other_dim2 int,
  SALT int,
  metric numeric(19,4)
)
PARTITION BY SomeHeirchalFunction

CREATE PROJECTION SCHEMA.Test_super(
  utc_time RLE,
  skewed_dimension RLE,
  other_dim1 RLE,
  other_dim2,
  SALT,
  metric
) AS
SELECT 
  utc_time,
  skewed_dimension,
  other_dim1,
  other_dim2,
  SALT,
  metric
FROM Schema.Test
ORDER BY
  utc_time,
  skewed_dimension,
  other_dim1,
  other_dim2,
  SALT (maybe?)
SEGMENTED BY HASH(skewed_dimension,SALT)

Then, for some query where you group by date (not hour!), you might do this

SELECT
  date,
  skewed_dimension,
  other_dim2
  sum(metric) as metric
FROM (
  SELECT
    date(utc_time) as date,
    skewed_dimension,
    other_dim2,
    sum(metric) as metric
  FROM Schema.Test
  WHERE 
    utc_time BETWEEN '2022-01-01 00:00:00' AND '2022-02-01 00:00:00'
    AND skewed_dimension IN (1,2,3,4,5,6,7)
  GROUP BY 
    date(utc_time),
    skewed_dimension,
    other_dim2
)
GROUP BY
  date,
  skewed_dimension,
  other_dim2

My understanding is that given a sufficiently large SALT, this should perform quite well, where data would be filtered effectively, and aggregation could be done locally on a node prior to resegmentation.

Is this a strategy that people are using for data have segmentation schemes that are relatively high value, balancing both disk access and leveraging the nodes for compute prior to resegmentation? What alternative approaches to this balance are there?

VValdar · August 2022

Just for the sake of preciseness, beware the use of partitioning as it has a different meaning that segmenting.
The later one is for data distribution across nodes - its goal is to ensure proper parallelism - the former one is to group data in different ROS inside each node based on the partition column expression, its goal is to reduce I/O when dealing with data.

I found this query here:

  select floor(hash(utc_time, skewed_dimension)/((9223372036854775807/(select count(*) from nodes)) + 1)) as node
       , count(*) as row_count
    from Test
group by 1
order by 1;

But it seems more an estimate than a real value, they must have changed the segmentation formula in an intermediate version.
I'll try to find something more accurate

For your second point, given your inputs timestamtz is the proper choice, no discussion here.

Bryan_H · June 2022

Have you tried loading a sample of data and running Database Designer using that sample table and one or more queries as samples? Low cardinality columns may not impact distribution as much as you think, unless they're the only hash column. It might suffice to have the skewed column first to ensure each value is colocated if that makes sense, but add other dimension columns to increase the hash range. You can see the segmentation in storage_locations with a query like:
select distinct projection_name, node_name, segment_lower_bound, segment_upper_bound from storage_containers where projection_name = order by 1,2 limit 10;
Internally, we've developed and run an air traffic monitoring demo that collects ADS-B data and displays a number of metrics. In this case, there are several fields that have low cardinality like boolean and enumerated type fields. A projection segmentation combining these in order where they're typically used for select and join has proved usable. Also, a Live Aggregate Projection is another useful way to pre-compute aggregates in the background so the impact of skewed data is less visible as it's resolved within the LAP: https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/AnalyzingData/AggregatedData/LiveAggregateProjectionExample.htm

VValdar · August 2022

Hi mferrall,

I was giving some thoughts around this use case, can't the utc_time column be serving as salt?
It should give proper distribution accross the nodes, maybe you can use just a part of the utc_time column like the hour:

create table Test
( utc_time          timestamp(0) with time zone NOT null
, skewed_dimension  int
, other_dim1        int
, other_dim2        int
, metric            numeric(19,4)
)
order by skewed_dimension, utc_time
segmented by hash(skewed_dimension, extract(hour from utc_time at time zone 'UTC')) all nodes
partition by cast(utc_time at time zone 'UTC' as date);

Oh btw, if the column utc_time is at time 'UTC' maybe you don't need the timestamp with time zone format, timestamp should be enough (and the DDL is easier):

create table Test_fco
( utc_time          timestamp(0) NOT null
, skewed_dimension  int
, other_dim1        int
, other_dim2        int
, metric            numeric(19,4)
)
order by skewed_dimension, utc_time
segmented by hash(skewed_dimension, extract(hour from utc_time)) all nodes
partition by utc_time::date;

For the record I did try those commands on a 11.0 version.

mferrall · August 2022

Hi @VValdar ,
Yes, I think that would work. The full context of this is that we currently partition by 5+ columns, and I thought to use 'salt' as a replacement for some of those columns. But utc_time is well distributed enough that I could just simplify to utc_time and skewed_dimension.

Two followup questions:

Would this be a decent query for estimating data distribution on a 10 node query if I were partitioning on utc_time and skewed_dimension?

SELECT
  count(*),
  MOD(HASH(utc_time, skewed_dimension), 10)
FROM
  Test
GROUP BY
  MOD(HASH(utc_time, skewed_dimension), 10)
ORDER BY
  1 DESC;

Second, Could you explain more about the benefit of the timestamp vs timestamptz? Our use case is that we

Store data on an hourly grain (values in our utc_column look like '2022-01-01 00:00:00', '2022-01-01 01: 00:00' etc)
Always store in utc time
We always aggregate at the day level
We have clients in many time zones and present daily aggregates in their time zone. So we group by something like date(utc_time at time zone 'America/New_York')

Would storing data in timestamps be more performant or offer other benefits?

Thanks for your help!

We're Moving!

Create My New Community Account Now

SALT in Segmentation Clause

Best Answer

Answers

Leave a Comment