Finding the "K" in K-means clustering with a UDx

Bryan_H · October 2019

You can apply k-means clustering to partition data points into k different groups. Along with the data, the number of clusters "k" is an input to the algorithm. Common examples like the Iris data set tell you up front how many different groups exist, so you set k=3. What if you don't know how many clusters to expect in your data set?

There are several approaches to estimate "k". Cebeci and Cebeci combined several methods into the R library "kpeaks", which we'll use here to predict "k".

The UDx is built on an R library, so first install Vertica R package from your usual source.

Next, install jsonlite and kpeaks packages into the Vertica R installation as shown at https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ExtendingVertica/R/RPackages.htm or:

$ sudo /opt/vertica/R/bin/R

install.packages("jsonlite");
install.packages("kpeaks");

You'll likely need to select a CRAN mirror from the download list.

Download the attached R file "kpeaks.R" and SQL file "kpeaks_test.sql" text files, rename them with correct extension (.R or .sql) and copy to a cluster node. Run kpeaks_test.sql with vsql, which does the following:

Loads the R library
Defines the kpeaks function
Loads the Iris clustering data set
Runs "kpeaks" on the Iris data set

You should get a JSON output like the following:

KPeaks_User

{"am":[2],"med":[2],"mod":[2],"mppc":[2],"cr":[2],"ciqr":[2],"mq3m":[3],"mtl":[2],"avgk":[2],"modk":[2],"mtlk":[2],"dst":["Full"],"pcounts":[2,1,2,3]}

So the methods implemented by kpeaks suggest there are 1-3 clusters in the data set. This should help reduce the number of trials needed to identify the best "k" for a data set. Check out the references below for a better understabnding of what the kpeaks results mean.

For more information and the math behind the library:
-- kpeaks documentation at https://cran.r-project.org/web/packages/kpeaks/kpeaks.pdf
-- kpeaks publication: Cebeci Z and Cebeci C, "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", https://www.researchgate.net/publication/331258718_kpeaks_An_R_Package_for_Quick_Selection_of_K_for_Cluster_Analysis

Enjoy!