Using the Triangle Inequality to Accelerate k-Means

The k-means algorithm is by far the most widely used method for discovering clusters in data. We show how to accelerate it dramatically, while still always computing exactly the same result as the standard algorithm. The accelerated algorithm avoids unnecessary distance calculations by applying the triangle inequality in two different ways, and by keeping track of lower and upper bounds for distances between points and centers. Experiments show that the new algorithm is effective for datasets with up to 1000 dimensions, and becomes more and more effective as the number k of clusters increases. For k ≥ 20 it is many times faster than the best previously known accelerated k-means method.

[1]  Walter A. Burkhard,et al.  Some approaches to best-match file searching , 1973, Commun. ACM.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[4]  Allen Gersho,et al.  Fast search algorithms for vector quantization and pattern matching , 1984, ICASSP.

[5]  Robert M. Gray,et al.  An Improvement of the Minimum Distortion Encoding Algorithm for Vector Quantization , 1985, IEEE Trans. Commun..

[6]  E. Ruiz An algorithm for finding nearest neighbours in (approximately) constant average time , 1986 .

[7]  Enrique Vidal-Ruiz,et al.  An algorithm for finding nearest neighbours in (approximately) constant average time , 1986, Pattern Recognit. Lett..

[8]  M. Hodgson Reducing the computational requirements of the minimum-distance classifier , 1988 .

[9]  V. Ramasubramanian,et al.  A generalized optimization of the K-d tree for fast nearest-neighbour search , 1989, Fourth IEEE Region 10 International Conference TENCON.

[10]  Michael T. Orchard,et al.  A fast nearest-neighbor search algorithm , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Luis Torres,et al.  ANALYSIS AND OPTIMIZATION OF THE K-MEANS ALGORITHM FOR REMOTE SENSING APPLICATIONS , 1992 .

[12]  Vance Faber,et al.  Clustering and the continuous k-means algorithm , 1994 .

[13]  Andrew W. Moore,et al.  Multiresolution Instance-Based Learning , 1995, IJCAI.

[14]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[15]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[16]  Sanjay Ranka,et al.  An effic ient k-means clustering algorithm , 1997 .

[17]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[18]  Anil K. Jain,et al.  Large-Scale Parallel Data Clustering , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[20]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[21]  J. Mcnames Rotated partial distance search for faster vector quantization encoding , 2000, IEEE Signal Processing Letters.

[22]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[23]  Sanjoy Dasgupta,et al.  Experiments with Random Projection , 2000, UAI.

[24]  David M. Mount,et al.  The analysis of a simple k-means clustering algorithm , 2000, SCG '00.

[25]  Ja-Chen Lin,et al.  Fast VQ encoding by an efficient kick-out condition , 2000, IEEE Trans. Circuits Syst. Video Technol..

[26]  Hanan Samet,et al.  Efficient Regular Data Structures and Algorithms for Dilation, Location, and Proximity Problems , 1999, Algorithmica.

[27]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[28]  J. Mielikainen,et al.  A novel full-search vector quantization algorithm based on the law of cosines , 2002, IEEE Signal Processing Letters.

[29]  C. Elkan,et al.  Alternatives to the k-means algorithm that find better clusterings , 2002, CIKM '02.

[30]  Steven J. Phillips Acceleration of K-Means and Related Clustering Algorithms , 2002, ALENEX.

[31]  Philip M. Long,et al.  Performance guarantees for hierarchical clustering , 2002, J. Comput. Syst. Sci..