On coresets for k-means and k-median clustering

In this paper, we show the existence of small coresets for the problems of computing k-median and k-means clustering for points in low dimension. In other words, we show that given a point set P in Rd, one can compute a weighted set S ⊆ P, of size O(k ε-d log n), such that one can compute the k-median/means clustering on S instead of on P, and get an (1+ε)-approximation. As a result, we improve the fastest known algorithms for (1+ε)-approximate k-means and k-median. Our algorithms have linear running time for a fixed k and ε. In addition, we can maintain the (1+ε)-approximate k-median or k-means clustering of a stream when points are being only inserted, using polylogarithmic space and update time.

[1]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[2]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[3]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[4]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[5]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[6]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[7]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems , 1998, JACM.

[8]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[9]  Sariel Har-Peled Clustering Motion , 2004, Discret. Comput. Geom..

[10]  Sariel Har-Peled A replacement for Voronoi diagrams of near linear size , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[11]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean TSP and other geometric problems , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[12]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[13]  Michiel H. M. Smid,et al.  Simple Randomized Algorithms for Closest Pair Problems , 1995, Nord. J. Comput..

[14]  Pankaj K. Agarwal,et al.  Approximation Algorithms for k-Line Center , 2002, ESA.

[15]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[16]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[17]  Hanan Samet,et al.  Efficient Regular Data Structures and Algorithms for Dilation, Location, and Proximity Problems , 1999, Algorithmica.

[18]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean kappa-median Problem , 1999, ESA.

[19]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[20]  Hanan Samet,et al.  Efficient regular data structures and algorithms for location and proximity problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[21]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[22]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[23]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[24]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2004, Comput. Geom..

[25]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[26]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[27]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[28]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[29]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[30]  Sariel Har-Peled,et al.  Projective clustering in high dimensions using core-sets , 2002, SCG '02.

[31]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[32]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.