An Efficient k-Means Clustering Algorithm: Analysis and Implementation

In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

[1]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[2]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[3]  J. Mcqueen Some methods for classi cation and analysis of multivariate observations , 1967 .

[4]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[5]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[7]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[8]  M. V. Bhat,et al.  An Efficient Clustering Algorithm , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[10]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[11]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[12]  D. Pollard A Central Limit Theorem for $k$-Means Clustering , 1982 .

[13]  Kenneth L. Clarkson,et al.  Fast algorithms for the all nearest neighbors problem , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[14]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Anil K. Jain,et al.  A spatial filtering approach to texture analysis , 1985, Pattern Recognit. Lett..

[16]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[17]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[18]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[19]  Teuvo Kohonen,et al.  Self-organization and associative memory: 3rd edition , 1989 .

[20]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[21]  Pravin M. Vaidya,et al.  AnO(n logn) algorithm for the all-nearest-neighbors Problem , 1989, Discret. Comput. Geom..

[22]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[23]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[24]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[25]  Gerhard J. Woeginger,et al.  Geometric Clusterings , 1991, J. Algorithms.

[26]  Marshall W. Bern,et al.  Approximate Closest-Point Queries in High Dimensions , 1993, Inf. Process. Lett..

[27]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[28]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[29]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[30]  Vance Faber,et al.  Clustering and the continuous k-means algorithm , 1994 .

[31]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[32]  S. Rao Kosaraju,et al.  A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields , 1995, JACM.

[33]  Hans-Peter Kriegel,et al.  A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[34]  Hiroshi Imai,et al.  Experimental results of randomized clustering algorithm , 1996, SCG '96.

[35]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[36]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[37]  Sunil Arya,et al.  ANN: library for approximate nearest neighbor searching , 1998 .

[38]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[39]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[40]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[41]  Pankaj K. Agarwal,et al.  Exact and Approximation Algortihms for Clustering , 1997 .

[42]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[43]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[44]  Michael T. Goodrich,et al.  Balanced aspect ratio trees: combining the advantages of k-d trees and octrees , 1999, SODA '99.

[45]  David M. Mount,et al.  Analysis of approximate nearest neighbor searching with clustered point sets , 1999, Data Structures, Near Neighbor Searches, and Methodology.

[46]  David M. Mount,et al.  Computing nearest neighbors for moving points and applications to clustering , 1999, SODA '99.

[47]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean kappa-median Problem , 1999, ESA.

[48]  J. Matou On Approximate Geometric K-clustering , 1999 .

[49]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[50]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[51]  Qiang Du,et al.  Centroidal Voronoi Tessellations: Applications and Algorithms , 1999, SIAM Rev..

[52]  Sunil Arya,et al.  Approximate range searching , 2000, Comput. Geom..

[53]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[54]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[55]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[56]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[57]  David M. Mount,et al.  The analysis of a simple k-means clustering algorithm , 2000, SCG '00.

[58]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[59]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[60]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[61]  Olvi L. Mangasarian,et al.  Mathematical Programming in Data Mining , 1997, Data Mining and Knowledge Discovery.