Coresets for $k$-Means and $k$-Median Clustering and their Applications

$\renewcommand{\Re}{{\rm I\!\hspace{-0.025em} R}} \newcommand{\eps}{{\varepsilon}} \newcommand{\Coreset}{{\mathcal{S}}} $ In this paper, we show the existence of small coresets for the problems of computing $k$-median and $k$-means clustering for points in low dimension. In other words, we show that given a point set $P$ in $\Re^d$, one can compute a weighted set $\Coreset \subseteq P$, of size $O(k \eps^{-d} \log{n})$, such that one can compute the $k$-median/means clustering on $\Coreset$ instead of on $P$, and get an $(1+\eps)$-approximation. As a result, we improve the fastest known algorithms for $(1+\eps)$-approximate $k$-means and $k$-median clustering. Our algorithms have linear running time for a fixed $k$ and $\eps$. In addition, we can maintain the $(1+\eps)$-approximate $k$-median or $k$-means clustering of a stream when points are being only inserted, using polylogarithmic space and update time.

[1]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[2]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[3]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[4]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[5]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[7]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[8]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[9]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[10]  Sariel Har-Peled A replacement for Voronoi diagrams of near linear size , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[11]  Sariel Har-Peled,et al.  Projective clustering in high dimensions using core-sets , 2002, SCG '02.

[12]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[13]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[14]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[15]  Pankaj K. Agarwal,et al.  Approximation Algorithms for k-Line Center , 2002, ESA.

[16]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[17]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[18]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[19]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[20]  Sariel Har-Peled Clustering Motion , 2004, Discret. Comput. Geom..

[21]  Tomás Feder,et al.  Optimal algorithms for approximate clustering , 1988, STOC '88.

[22]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[23]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[24]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean TSP and other geometric problems , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[25]  Michiel H. M. Smid,et al.  Simple Randomized Algorithms for Closest Pair Problems , 1995, Nord. J. Comput..

[26]  J. Matou On Approximate Geometric K-clustering , 1999 .