k-means++: the advantages of careful seeding

The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(logk)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically.

[1]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[2]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[3]  C. Müller,et al.  Large-scale clustering of cDNA-fingerprinting data. , 1999, Genome research.

[4]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[5]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[6]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[7]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[8]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[9]  Sanjoy Dasgupta How fast is κ-means? , 2003 .

[10]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Sanjoy Dasgupta How Fast Is k-Means? , 2003, COLT.

[12]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[13]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[14]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[15]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[16]  Amit Kumar,et al.  A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[17]  Nabil H. Mustafa,et al.  k-means projective clustering , 2004, PODS.

[18]  F. Gibou A fast hybrid k-means level set algorithm for segmentation , 2005 .

[19]  Sariel Har-Peled,et al.  How Fast Is the k-Means Method? , 2005, SODA '05.

[20]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[21]  R. Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[23]  Sergei Vassilvitskii,et al.  Worst-Case and Smoothed Analysis of the ICP Algorithm, with an Application to the k-Means Method , 2009, SIAM J. Comput..