Determining Candidate Segmentation Keys

[Deleted User] · July 2018

Jim Knicely authored this tip.

Hash segmentation allows you to segment a projection based on a built-in hash function that provides even distribution of data across multiple nodes, resulting in optimal query execution. In a projection, the data to be hashed consists of one or more column values, each having a large number of unique values and an acceptable amount of skew in the value distribution. Primary key columns that meet the criteria could be an excellent choice for hash segmentation.

What if you don’t have a primary key?

You can easily determine a great candidate segmentation key with a simple aggregate query on the table.

SELECT FLOOR(HASH(Candidate-Keys-Separated-by-Comma)/((9223372036854775807/Cluster-Node-Count) + 1)) AS Node,
       COUNT(*) AS row_count
  FROM Table-Schema.Table-Name
GROUP BY 1
ORDER BY 1;

Example:

dbadmin=> SELECT FLOOR(HASH(DateKey, AccountKey)/((9223372036854775807/3) + 1)) AS Node,
dbadmin->        COUNT(*) AS row_count
dbadmin->   FROM public.BankAcct_Fact
dbadmin->  GROUP BY 1
dbadmin->  ORDER BY 1;
 Node | row_count
------+-----------
    0 | 135143741
    1 | 135133727
    2 | 135187445
(3 rows)

Because the row counts are evenly distributed across the cluster, the DateKey and AccountKey columns hashed together make a great candidate segmentation key.

Have fun!

We're Moving!

Create My New Community Account Now

Determining Candidate Segmentation Keys