Determining Candidate Segmentation Keys
Jim Knicely authored this tip.
Hash segmentation allows you to segment a projection based on a built-in hash function that provides even distribution of data across multiple nodes, resulting in optimal query execution. In a projection, the data to be hashed consists of one or more column values, each having a large number of unique values and an acceptable amount of skew in the value distribution. Primary key columns that meet the criteria could be an excellent choice for hash segmentation.
What if you don’t have a primary key?
You can easily determine a great candidate segmentation key with a simple aggregate query on the table.
SELECT FLOOR(HASH(Candidate-Keys-Separated-by-Comma)/((9223372036854775807/Cluster-Node-Count) + 1)) AS Node, COUNT(*) AS row_count FROM Table-Schema.Table-Name GROUP BY 1 ORDER BY 1;
Example:
dbadmin=> SELECT FLOOR(HASH(DateKey, AccountKey)/((9223372036854775807/3) + 1)) AS Node, dbadmin-> COUNT(*) AS row_count dbadmin-> FROM public.BankAcct_Fact dbadmin-> GROUP BY 1 dbadmin-> ORDER BY 1; Node | row_count ------+----------- 0 | 135143741 1 | 135133727 2 | 135187445 (3 rows)
Because the row counts are evenly distributed across the cluster, the DateKey and AccountKey columns hashed together make a great candidate segmentation key.
Have fun!