The Effectiveness of Lloyd-Type Methods for the k-Means Problem

We investigate variants of Lloyd's heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd's heuristic that quickly lead to provably near-optimal clustering solutions when applied to well-clusterable instances. This is the first performance guarantee for a variant of Lloyd's heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd's method. In addition, our other algorithms are faster on well-clusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloyd-type iteration.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[3]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[4]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[5]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[6]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[7]  Rafail Ostrovsky,et al.  Secure two-party k-means clustering , 2007, CCS '07.

[8]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[9]  Sanjoy Dasgupta How fast is κ-means? , 2003 .

[10]  J. Matou On Approximate Geometric K-clustering , 1999 .

[11]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..

[12]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[13]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[14]  P Willett,et al.  Comparison of algorithms for dissimilarity-based compound selection. , 1997, Journal of molecular graphics & modelling.

[15]  Douglas H. Fisher,et al.  Iterative Optimization and Simplification of Hierarchical Clusterings , 1996, J. Artif. Intell. Res..

[16]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[17]  D. Cox Note on Grouping , 1957 .

[18]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[19]  Marek Chrobak,et al.  The reverse greedy algorithm for the metric k-median problem , 2006, Inf. Process. Lett..

[20]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[21]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[22]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[23]  Dan Suciu,et al.  Journal of the ACM , 2006 .

[24]  Michelle Effros,et al.  Deterministic clustering with data nets , 2004, Electron. Colloquium Comput. Complex..

[25]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[26]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[27]  Marek Chrobak,et al.  The reverse greedy algorithm for the metric k-median problem , 2005, Inf. Process. Lett..

[28]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[29]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[30]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[31]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[33]  Sariel Har-Peled,et al.  How Fast Is the k-Means Method? , 2005, SODA '05.

[34]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2004, Comput. Geom..

[35]  Ian A. Watson,et al.  Experimental Designs for Selecting Molecules from Large Chemical Databases , 1997, J. Chem. Inf. Comput. Sci..

[36]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[37]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[38]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[39]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[40]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[41]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[42]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[43]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[44]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[45]  Sanjoy Dasgupta How Fast Is k-Means? , 2003, COLT.

[46]  Joel Max,et al.  Quantizing for minimum distortion , 1960, IRE Trans. Inf. Theory.

[47]  Rafail Ostrovsky,et al.  Polynomial-time approximation schemes for geometric min-sum median clustering , 2002, JACM.

[48]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[49]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[50]  Evangelos Markakis,et al.  Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP , 2002, JACM.

[51]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[52]  Marina Meila,et al.  An Experimental Comparison of Several Clustering and Initialization Methods , 1998, UAI.

[53]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.