论文信息 - Using the Triangle Inequality to Accelerate k-Means

Using the Triangle Inequality to Accelerate k-Means

The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k ≥ 20 it is many times faster than the best previously known accelerated k-means method.

Charles Elkan | C. Elkan

[1] Walter A. Burkhard,et al. Some approaches to best-match file searching , 1973, Commun. ACM.

[2] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[4] Allen Gersho,et al. Fast search algorithms for vector quantization and pattern matching , 1984, ICASSP.

[5] Robert M. Gray,et al. An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[6] E. Ruiz. An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[7] Enrique Vidal-Ruiz,et al. An algorithm for finding nearest neighbours in (approximately) constant average time , 1986, Pattern Recognit. Lett..

[8] M. Hodgson. Reducing the computational requirements of the minimum-distance classifier , 1988 .

[9] V. Ramasubramanian,et al. A generalized optimization of the K-d tree for fast nearest-neighbour search , 1989, Fourth IEEE Region 10 International Conference TENCON.

[10] Michael T. Orchard,et al. A fast nearest-neighbor search algorithm , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[11] Luis Torres,et al. ANALYSIS AND OPTIMIZATION OF THE K-MEANS ALGORITHM FOR REMOTE SENSING APPLICATIONS , 1992 .

[12] Vance Faber,et al. Clustering and the continuous k-means algorithm , 1994 .

[13] Andrew W. Moore,et al. Multiresolution Instance-Based Learning , 1995, IJCAI.

[14] Charles Elkan,et al. The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[15] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[16] Sanjay Ranka,et al. An effic ient k-means clustering algorithm , 1997 .

[17] Piotr Indyk,et al. Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18] Anil K. Jain,et al. Large-Scale Parallel Data Clustering , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[19] Andrew W. Moore,et al. Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[20] Charles Elkan,et al. Scalability for clustering algorithms revisited , 2000, SKDD.

[21] J. Mcnames. Rotated partial distance search for faster vector quantization encoding , 2000, IEEE Signal Processing Letters.

[22] Andrew W. Moore,et al. The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.