Train a K-means model on the given set of points; data
should be cached for high
performance, because this is an iterative algorithm.
Set the distance threshold within which we've consider centers to have converged.
Set the distance threshold within which we've consider centers to have converged. If all centers move less than this Euclidean distance, we stop iterating one run.
Set the initialization algorithm.
Set the initialization algorithm. This can be either "random" to choose random points as initial cluster centers, or "k-means||" to use a parallel variant of k-means++ (Bahmani et al., Scalable K-Means++, VLDB 2012). Default: k-means||.
Set the number of steps for the k-means|| initialization mode.
Set the number of steps for the k-means|| initialization mode. This is an advanced setting -- the default of 5 is almost always enough. Default: 5.
Set the number of clusters to create (k).
Set the number of clusters to create (k). Default: 2.
Set maximum number of iterations to run.
Set maximum number of iterations to run. Default: 20.
Set the number of runs of the algorithm to execute in parallel.
Set the number of runs of the algorithm to execute in parallel. We initialize the algorithm this many times with random starting conditions (configured by the initialization mode), then return the best clustering found over any run. Default: 1.
K-means clustering with support for multiple parallel runs and a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al). When multiple concurrent runs are requested, they are executed together with joint passes over the data for efficiency.
This is an iterative algorithm that will make multiple passes over the data, so any RDDs given to it should be cached by the user.