论文信息 - Accelerating exact k-means algorithms with geometric reasoning

Accelerating exact k-means algorithms with geometric reasoning

Abstract : We present new algorithms for the k-means clustering problem. They use the kd-tree data structure to reduce the large number of nearest-neighbor queries issued by the traditional algorithm. Sufficient statistics are stored in the nodes of the kd-tree. Then an analysis of the geometry of the current cluster centers results in great reduction of the work needed to update the centers. Our algorithms behave exactly as the traditional k-means algorithm. Proofs of correctness are included. The kd-tree can also be used to initialize the k-means starting centers efficiently. Our algorithms can be easily extended to provide fast ways of computing the error of a given cluster assignment regardless of the method in which those clusters were obtained. We also show how to use them in a setting which allows approximate clustering results, with the benefit of running faster. We have implemented and tested our algorithms on both real and simulated data. Results show a speedup factor of up to 170 on real astrophysical data, and superiority over the naive algorithm on simulated data in up to 5 dimensions. Our algorithms scale well with respect to the number of points and number of centers allowing for clustering with tens of thousands of centers.

Andrew W. Moore | Dan Pelleg | A. Moore | D. Pelleg

[1] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2] Peter E. Hart,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3] Jon Louis Bentley,et al. Multidimensional divide-and-conquer , 1980, CACM.

[4] Andrew W. Moore,et al. Efficient memory-based learning for robot control , 1990 .

[5] Allen Gersho,et al. Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[6] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[7] Jiawei Han,et al. Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[8] R. Ng,et al. Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[9] Andrew W. Moore,et al. Multiresolution Instance-Based Learning , 1995, IJCAI.

[10] Hans-Peter Kriegel,et al. A Database Interface for Clustering in Large Spatial Databases , 1995, KDD.

[11] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[12] Mary S. Lee. Cached Suucient Statistics for Eecient Machine Learning with Large Datasets 1. Caching Suucient Statistics , 1997 .

[13] Paul S. Bradley,et al. Refining Initial Points for K-Means Clustering , 1998, ICML.

[14] Andrew W. Moore,et al. Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[15] Andrew W. Moore,et al. Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..