A Simple D2-Sampling Based PTAS for k-Means and Other Clustering Problems

AbstractGiven a set of points $P \subset\mathbb{R}^{d}$, the k-means clustering problem is to find a set of kcenters$C = \{ c_{1},\ldots,c_{k}\}, c_{i} \in\mathbb{R}^{d}$, such that the objective function ∑x∈Pe(x,C)2, where e(x,C) denotes the Euclidean distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that has been studied with respect to clustering. D2-sampling (Arthur and Vassilvitskii, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points $P \subset\mathbb{R}^{d}$, the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled point. D2-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) show that k points chosen as centers from P using D2-sampling give an O(logk) approximation in expectation. Ailon et al. (NIPS, pp. 10–18, 2009) and Aggarwal et al. (Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 15–28, Springer, Berlin, 2009) extended results of Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) to show that O(k) points chosen as centers using D2-sampling give an O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D2-sampling by giving a simple randomized (1+ϵ)-approximation algorithm that uses the D2-sampling in its core.

[1]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[2]  Johannes Blömer,et al.  Bregman Clustering for Separable Instances , 2010, SWAT.

[3]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[4]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[5]  Nir Ailon,et al.  Streaming k-means approximation , 2009, NIPS.

[6]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[7]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[8]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[9]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[10]  S. Dasgupta The hardness of k-means clustering , 2008 .

[11]  Antony F. R. Brown Language Translation , 1958, JACM.

[12]  Dan Feldman,et al.  Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[13]  Bodo Manthey,et al.  Smoothed Analysis of the k-Means Method , 2011, JACM.

[14]  J. Matou On Approximate Geometric K-clustering , 1999 .

[15]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[16]  Amit Kumar,et al.  A Simple D 2-Sampling Based PTAS for k-Means and other Clustering Problems , 2012, COCOON.

[17]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[18]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[19]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[20]  Shimon Even,et al.  An On-Line Edge-Deletion Problem , 1981, JACM.

[21]  Sariel Har-Peled,et al.  How Fast Is the k-Means Method? , 2005, SODA '05.

[22]  Ankit Aggarwal,et al.  Adaptive Sampling for k-Means Clustering , 2009, APPROX-RANDOM.

[23]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[24]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[25]  Andrea Vattani,et al.  k-means Requires Exponentially Many Iterations Even in the Plane , 2008, SCG '09.

[26]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[27]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[28]  Johannes Blömer,et al.  Coresets and approximate clustering for Bregman divergences , 2009, SODA.

[29]  Tricia Walker,et al.  Computer science , 1996, English for academic purposes series.

[30]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[31]  Marcel R. Ackermann Algorithms for the Bregman k-Median problem , 2009 .

[32]  Jan Rittinger,et al.  Efficient and Effective Querying by Image Content , 2004 .

[33]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[34]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[35]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[36]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[37]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[38]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[39]  Marcel R. Ackermann,et al.  Clustering for metric and non-metric distance measures , 2008, SODA '08.