论文信息 - An Efficient k-Means Clustering Algorithm: Analysis and Implementation

An Efficient k-Means Clustering Algorithm: Analysis and Implementation

In k-means clustering, we are given a set of n data points in d-dimensional space R/sup d/ and an integer k and the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. A popular heuristic for k-means clustering is Lloyd's (1982) algorithm. We present a simple and efficient implementation of Lloyd's k-means clustering algorithm, which we call the filtering algorithm. This algorithm is easy to implement, requiring a kd-tree as the only major data structure. We establish the practical efficiency of the filtering algorithm in two ways. First, we present a data-sensitive analysis of the algorithm's running time, which shows that the algorithm runs faster as the separation between clusters increases. Second, we present a number of empirical studies both on synthetically generated data and on real data sets from applications in color quantization, data compression, and image segmentation.

[1] Feller William,et al. An Introduction To Probability Theory And Its Applications , 1950 .

[2] E. Forgy,et al. Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[3] J. Mcqueen. Some methods for classi cation and analysis of multivariate observations , 1967 .

[4] J. MacQueen. Some methods for classification and analysis of multivariate observations , 1967 .

[5] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[6] Peter E. Hart,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[7] Jon Louis Bentley,et al. Multidimensional binary search trees used for associative searching , 1975, CACM.

[8] M. V. Bhat,et al. An Efficient Clustering Algorithm , 1976, IEEE Transactions on Systems, Man, and Cybernetics.

[9] Jon Louis Bentley,et al. An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[10] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[11] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[12] D. Pollard. A Central Limit Theorem for $k$-Means Clustering , 1982 .

[13] Kenneth L. Clarkson,et al. Fast algorithms for the all nearest neighbors problem , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[14] Shokri Z. Selim,et al. K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Anil K. Jain,et al. A spatial filtering approach to texture analysis , 1985, Pattern Recognit. Lett..

[16] Michael Ian Shamos,et al. Computational geometry: an introduction , 1985 .

[17] Teuvo Kohonen,et al. Self-Organization and Associative Memory , 1988 .

[18] Anil K. Jain,et al. Algorithms for Clustering Data , 1988 .

[19] Teuvo Kohonen,et al. Self-organization and associative memory: 3rd edition , 1989 .

[20] Hanan Samet,et al. The Design and Analysis of Spatial Data Structures , 1989 .

[21] Pravin M. Vaidya,et al. AnO(n logn) algorithm for the all-nearest-neighbors Problem , 1989, Discret. Comput. Geom..

[22] Keinosuke Fukunaga,et al. Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[23] Peter J. Rousseeuw,et al. Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[24] Allen Gersho,et al. Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[25] Gerhard J. Woeginger,et al. Geometric Clusterings , 1991, J. Algorithms.

[26] Marshall W. Bern,et al. Approximate Closest-Point Queries in High Dimensions , 1993, Inf. Process. Lett..

[27] Yoshua Bengio,et al. Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[28] Mary Inaba,et al. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[29] Jiawei Han,et al. Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[30] Vance Faber,et al. Clustering and the continuous k-means algorithm , 1994 .

[31] M. Inaba. Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[32] S. Rao Kosaraju,et al. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields , 1995, JACM.

[33] Hans-Peter Kriegel,et al. A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[34] Hiroshi Imai,et al. Experimental results of randomized clustering algorithm , 1996, SCG '96.

[35] Sanjay Ranka,et al. An effic ient k-means clustering algorithm , 1997 .

[36] Sunil Arya,et al. An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[37] Sunil Arya,et al. ANN: library for approximate nearest neighbor searching , 1998 .

[38] Paul S. Bradley,et al. Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[39] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[40] Satish Rao,et al. Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[41] Pankaj K. Agarwal,et al. Exact and Approximation Algortihms for Clustering , 1997 .

[42] Andrew W. Moore,et al. Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[43] Andrew W. Moore,et al. Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[44] Michael T. Goodrich,et al. Balanced aspect ratio trees: combining the advantages of k-d trees and octrees , 1999, SODA '99.

[45] David M. Mount,et al. Analysis of approximate nearest neighbor searching with clustered point sets , 1999, Data Structures, Near Neighbor Searches, and Methodology.

[46] David M. Mount,et al. Computing nearest neighbors for moving points and applications to clustering , 1999, SODA '99.

[47] Satish Rao,et al. A Nearly Linear-Time Approximation Scheme for the Euclidean kappa-median Problem , 1999, ESA.

[48] J. Matou. On Approximate Geometric K-clustering , 1999 .

[49] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[50] Sanjoy Dasgupta,et al. Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[51] Qiang Du,et al. Centroidal Voronoi Tessellations: Applications and Algorithms , 1999, SIAM Rev..

[52] Sunil Arya,et al. Approximate range searching , 2000, Comput. Geom..

[53] Anil K. Jain,et al. Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[54] Sanjoy Dasgupta,et al. A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[55] Jirí Matousek,et al. On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[56] Thomas de Quincey. [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[57] David M. Mount,et al. The analysis of a simple k-means clustering algorithm , 2000, SCG '00.

[58] Andrew W. Moore,et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[59] Tian Zhang,et al. BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[60] Gregory Piatetsky-Shapiro,et al. Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[61] Olvi L. Mangasarian,et al. Mathematical Programming in Data Mining , 1997, Data Mining and Knowledge Discovery.