Please take this survey to help us learn more about how you use third party tools. Your input is greatly appreciated!

SALT in Segmentation Clause

Hi All,
In my domain, we have very skewed data, which makes segmentation across a cluster difficult in a way that priortizes both disk acess and the ability to perform aggregations on a node prior to redistribution across the cluster. In Spark, you might use a SALT to work around this, where you add some random column to distribute data, include that SALT in local groupings, then drop it in any aggregations that are network wide.

Take this example (assume that other_dim1 and other_dim2 are loosely correlated to skewed_dimension)

CREATE TABLE Schema.Test (
  utc_time timestamptz NOT NULL,
  skewed_dimension int,
  other_dim1 int,
  other_dim2 int,
  SALT int,
  metric numeric(19,4)
)
PARTITION BY SomeHeirchalFunction

CREATE PROJECTION SCHEMA.Test_super(
  utc_time RLE,
  skewed_dimension RLE,
  other_dim1 RLE,
  other_dim2,
  SALT,
  metric
) AS
SELECT 
  utc_time,
  skewed_dimension,
  other_dim1,
  other_dim2,
  SALT,
  metric
FROM Schema.Test
ORDER BY
  utc_time,
  skewed_dimension,
  other_dim1,
  other_dim2,
  SALT (maybe?)
SEGMENTED BY HASH(skewed_dimension,SALT)

Then, for some query where you group by date (not hour!), you might do this

SELECT
  date,
  skewed_dimension,
  other_dim2
  sum(metric) as metric
FROM (
  SELECT
    date(utc_time) as date,
    skewed_dimension,
    other_dim2,
    sum(metric) as metric
  FROM Schema.Test
  WHERE 
    utc_time BETWEEN '2022-01-01 00:00:00' AND '2022-02-01 00:00:00'
    AND skewed_dimension IN (1,2,3,4,5,6,7)
  GROUP BY 
    date(utc_time),
    skewed_dimension,
    other_dim2
)
GROUP BY
  date,
  skewed_dimension,
  other_dim2

My understanding is that given a sufficiently large SALT, this should perform quite well, where data would be filtered effectively, and aggregation could be done locally on a node prior to resegmentation.

Is this a strategy that people are using for data have segmentation schemes that are relatively high value, balancing both disk access and leveraging the nodes for compute prior to resegmentation? What alternative approaches to this balance are there?

Answers

  • Bryan_HBryan_H Administrator

    Have you tried loading a sample of data and running Database Designer using that sample table and one or more queries as samples? Low cardinality columns may not impact distribution as much as you think, unless they're the only hash column. It might suffice to have the skewed column first to ensure each value is colocated if that makes sense, but add other dimension columns to increase the hash range. You can see the segmentation in storage_locations with a query like:
    select distinct projection_name, node_name, segment_lower_bound, segment_upper_bound from storage_containers where projection_name = order by 1,2 limit 10;
    Internally, we've developed and run an air traffic monitoring demo that collects ADS-B data and displays a number of metrics. In this case, there are several fields that have low cardinality like boolean and enumerated type fields. A projection segmentation combining these in order where they're typically used for select and join has proved usable. Also, a Live Aggregate Projection is another useful way to pre-compute aggregates in the background so the impact of skewed data is less visible as it's resolved within the LAP: https://www.vertica.com/docs/11.1.x/HTML/Content/Authoring/AnalyzingData/AggregatedData/LiveAggregateProjectionExample.htm

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file

Can't find what you're looking for? Search the Vertica Documentation, Knowledge Base, or Blog for more information.