On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications

We present new approximation algorithms for the $k$-median and $k$-means clustering problems. To this end, we obtain small coresets for $k$-median and $k$-means clustering in general metric spaces and in Euclidean spaces. In $\mathbb{R}^d$, these coresets are of size with polynomial dependency on the dimension $d$. This leads to $(1+\varepsilon)$-approximation algorithms to the optimal $k$-median and $k$-means clustering in $\mathbb{R}^d$, with running time $O(ndk+2^{(k/\varepsilon)^{O(1)}}d^2\log^{k+2}n)$, where $n$ is the number of points. This improves over previous results. We use those coresets to maintain a $(1+\varepsilon)$-approximate $k$-median and $k$-means clustering of a stream of points in $\mathbb{R}^d$, using $O(d^2k^2\varepsilon^{-2}\log^8n)$ space. These are the first streaming algorithms, for those problems, that have space complexity with polynomial dependency on the dimension.

[1]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[2]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[3]  Amit Kumar,et al.  A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[4]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[5]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[6]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[7]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[8]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[9]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[10]  Dan Feldman Coresets for Weighted Facilities and Their Applications , 2006 .

[11]  Marcel R. Ackermann,et al.  Clustering for metric and non-metric distance measures , 2008, SODA '08.

[12]  Piotr Indyk,et al.  Algorithms for dynamic geometric problems over data streams , 2004, STOC '04.

[13]  Michelle Effros,et al.  Deterministic clustering with data nets , 2004, Electron. Colloquium Comput. Complex..

[14]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[15]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[16]  Kenneth L. Clarkson,et al.  Optimal core-sets for balls , 2008, Comput. Geom..

[17]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[18]  Amit Kumar,et al.  Linear Time Algorithms for Clustering Problems in Any Dimensions , 2005, ICALP.

[19]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem , 2007, SIAM J. Comput..

[20]  R. Motwani,et al.  High-Dimensional Computational Geometry , 2000 .

[21]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[22]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[23]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[24]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[25]  Johannes Blömer,et al.  Coresets and approximate clustering for Bregman divergences , 2009, SODA.

[26]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[27]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[28]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.