There has been considerable work on improving popular clustering algorithm ‘K-means’ in terms of mean squared error (MSE) and speed, both. However, most of the k-means variants tend to compute distance of each data point to each cluster centroid for every iteration. We propose a fast heuristic to overcome this bottleneck with only marginal increase in MSE. We observe that across all iterations of K-means, a data point changes its membership only among a small subset of clusters. Our heuristic predicts such clusters for each data point by looking at nearby clusters after the first iteration of k-means. We augment well known variants of k-means with our heuristic to demonstrate effectiveness of our heuristic. For various synthetic and real-world datasets, our heuristic achieves speed-up of up-to 3 times when compared to efficient variants of k-means.
[1]
Charles Elkan,et al.
Using the Triangle Inequality to Accelerate k-Means
,
2003,
ICML.
[2]
Sergei Vassilvitskii,et al.
k-means++: the advantages of careful seeding
,
2007,
SODA '07.
[3]
S. P. Lloyd,et al.
Least squares quantization in PCM
,
1982,
IEEE Trans. Inf. Theory.
[4]
Abdel-Badeeh M. Salem,et al.
An efficient enhanced k-means clustering algorithm
,
2006
.
[5]
Andrew W. Moore,et al.
Accelerating exact k-means algorithms with geometric reasoning
,
1999,
KDD '99.
[6]
Nikos A. Vlassis,et al.
The global k-means clustering algorithm
,
2003,
Pattern Recognit..
[7]
D. Pham,et al.
Selection of K in K-means clustering
,
2005
.
[8]
D.M. Mount,et al.
An Efficient k-Means Clustering Algorithm: Analysis and Implementation
,
2002,
IEEE Trans. Pattern Anal. Mach. Intell..