On k-Median clustering in high dimensions

We study approximation algorithms for <i>k</i>-median clustering. We obtain small coresets for <i>k</i>-median clustering in metric spaces as well as in Euclidean spaces. Specifically, in R<sup>d</sup>, those coresets are of size with only <i>polynomial</i> dependency on <i>d</i>. This leads to a (1 + ε)-approximation algorithm for <i>k</i>-median clustering in R<sup>d</sup>, with running time <i>O</i>(<i>ndk</i> +2<sup>(k/ε)<sup><i>o</i>(1)</sup></sup><i>d</i><sup>2</sup><i>n</i>σ), for any σ > 0. This is an improvement over previous results [5, 20, 21]. We also provide fast constant factor approximation algorithms for <i>k</i>-median clustering in finite metric spaces.We use those coresets to compute (1 + ε)-approximation <i>k</i>-median clustering in the streaming model of computation, using only <i>O</i>(<i>k</i><sup>2</sup><i>de</i><sup>-2</sup>log<sup>8</sup> <i>n</i>) space, where the points are taken from R<sup>d</sup>. This is the first streaming algorithm, for this problem, that has space complexity with only <i>polynomial</i> dependency on the dimension.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[3]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[4]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[5]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[6]  Amit Kumar,et al.  Linear Time Algorithms for Clustering Problems in Any Dimensions , 2005, ICALP.

[7]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[8]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[9]  Piotr Indyk,et al.  Algorithms for dynamic geometric problems over data streams , 2004, STOC '04.

[10]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[11]  Amit Kumar,et al.  A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[12]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[13]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[14]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[15]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[16]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[17]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[18]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[19]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[20]  Kenneth L. Clarkson,et al.  Optimal core-sets for balls , 2008, Comput. Geom..

[21]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[22]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[23]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean kappa-median Problem , 1999, ESA.

[24]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[25]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[26]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.