Finding the "K" in K-means clustering with a UDx
You can apply k-means clustering to partition data points into k different groups. Along with the data, the number of clusters "k" is an input to the algorithm. Common examples like the Iris data set tell you up front how many different groups exist, so you set k=3. What if you don't know how many clusters to expect in your data set?
There are several approaches to estimate "k". Cebeci and Cebeci combined several methods into the R library "kpeaks", which we'll use here to predict "k".
The UDx is built on an R library, so first install Vertica R package from your usual source.
Next, install jsonlite and kpeaks packages into the Vertica R installation as shown at https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/ExtendingVertica/R/RPackages.htm or:
$ sudo /opt/vertica/R/bin/R
You'll likely need to select a CRAN mirror from the download list.
Download the attached R file "kpeaks.R" and SQL file "kpeaks_test.sql" text files, rename them with correct extension (.R or .sql) and copy to a cluster node. Run kpeaks_test.sql with vsql, which does the following:
- Loads the R library
- Defines the kpeaks function
- Loads the Iris clustering data set
- Runs "kpeaks" on the Iris data set
You should get a JSON output like the following:
So the methods implemented by kpeaks suggest there are 1-3 clusters in the data set. This should help reduce the number of trials needed to identify the best "k" for a data set. Check out the references below for a better understabnding of what the kpeaks results mean.
For more information and the math behind the library:
-- kpeaks documentation at https://cran.r-project.org/web/packages/kpeaks/kpeaks.pdf
-- kpeaks publication: Cebeci Z and Cebeci C, "kpeaks: An R Package for Quick Selection of K for Cluster Analysis", https://www.researchgate.net/publication/331258718_kpeaks_An_R_Package_for_Quick_Selection_of_K_for_Cluster_Analysis