Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics

We study the problem of constructing ε-coresets for the (k, z)-clustering problem in a doubling metric M(X, d). An ε-coreset is a weighted subset S ⊆ X with weight function w: S → R_≥ 0, such that for any k-subset C ε [X]^k, it holds that Σ_xεS w(x) ⋅ d^z(x, C)ε (1±ε) ⋅ Σ_xεX d^z(x, C). We present an efficient algorithm that constructs an ε-coreset for the (k, z)-clustering problem in M(X, d), where the size of the coreset only depends on the parameters k, z, ε and the doubling dimension ddim(M). To the best of our knowledge, this is the first efficient ε-coreset construction of size independent of |X| for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of M(X, d) and the shattering dimension (or VC-dimension) of the range space induced by the distance d. Such a relation was not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small (1 ± ε)-distortion of the distance function d, and consider the notion of τ-error probabilistic shattering dimension, we can prove an upper bound of O(ddim(M)⋅ log(1/ε) +log log 1/τ) for the probabilistic shattering dimension for even weighted doubling metrics. We believe this new relation is of independent interest and may find other applications. We also study the robust coresets and centroid sets in doubling metrics. Our robust coreset construction leads to new results in clustering and property testing, and the centroid sets can be used to accelerate the local search algorithms for clustering problems were only known for Euclidean spaces. We can apply our centroid set to accelerate the local search algorithm (studied in [Friggstad et al., FOCS 2016]) for the (k, z)-clustering problem in doubling metrics.

[1]  Lee-Ad Gottlieb,et al.  Improved algorithms for fully dynamic geometric spanners and geometric routing , 2008, SODA '08.

[2]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[3]  Lee-Ad Gottlieb,et al.  An Optimal Dynamic Spanner for Doubling Metric Spaces , 2008, ESA.

[4]  Robert Krauthgamer,et al.  Bounded geometries, fractals, and low-distortion embeddings , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[5]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[6]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7]  Aravind Srinivasan,et al.  Randomized Distributed Edge Coloring via an Extension of the Chernoff-Hoeffding Bounds , 1997, SIAM J. Comput..

[8]  T.-H. Hubert Chan,et al.  Reducing Curse of Dimensionality , 2016, SODA.

[9]  Ittai Abraham,et al.  Advances in metric embedding theory , 2006, STOC '06.

[10]  Mohammad R. Salavatipour,et al.  Local Search Yields a PTAS for k-Means in Doubling Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[11]  Li Ning,et al.  New Doubling Spanners: Better and Simpler , 2013, SIAM J. Comput..

[12]  Li Ning,et al.  Sparse Fault-Tolerant Spanners for Doubling Metrics with Bounded Hop-Diameter or Degree , 2013, Algorithmica.

[13]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[14]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[15]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[16]  Andrew Y. Ng,et al.  Learning Feature Representations with K-Means , 2012, Neural Networks: Tricks of the Trade.

[17]  Noga Alon,et al.  Testing of Clustering , 2003, SIAM J. Discret. Math..

[18]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[19]  P. Assouad Plongements lipschitziens dans Rn , 2003 .

[20]  Kunal Talwar,et al.  Bypassing the embedding: algorithms for low dimensional metrics , 2004, STOC '04.

[21]  Richard Cole,et al.  Searching dynamic point sets in spaces with bounded doubling dimension , 2006, STOC '06.

[22]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[23]  Dan Feldman,et al.  Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[24]  J. Matou On Approximate Geometric K-clustering , 1999 .

[25]  Piotr Indyk,et al.  Nearest-neighbor-preserving embeddings , 2007, TALG.

[26]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[27]  Rajiv Gandhi,et al.  Dependent rounding and its applications to approximation algorithms , 2006, JACM.

[28]  Sariel Har-Peled Clustering Motion , 2004, Discret. Comput. Geom..

[29]  Lee-Ad Gottlieb,et al.  Efficient Classification for Metric Data , 2014, IEEE Trans. Inf. Theory.

[30]  Kenneth L. Clarkson,et al.  Nearest Neighbor Queries in Metric Spaces , 1997, STOC '97.

[31]  Anupam Gupta,et al.  Ultra-low-dimensional embeddings for doubling metrics , 2008, SODA '08.

[32]  Vladimir Braverman,et al.  Clustering High Dimensional Dynamic Data Streams , 2017, ICML.

[33]  T.-H. Hubert Chan,et al.  A PTAS for the Steiner Forest Problem in Doubling Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[34]  Andreas Krause,et al.  Scalable and Distributed Clustering via Lightweight Coresets , 2017, ArXiv.

[35]  Shay Solomon From hierarchical partitions to hierarchical covers: optimal fault-tolerant spanners for doubling metrics , 2014, STOC.

[36]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory A.

[37]  Andreas Krause,et al.  Training Mixture Models at Scale via Coresets , 2017 .

[38]  Yi Li,et al.  Using the doubling dimension to analyze the generalization of learning algorithms , 2009, J. Comput. Syst. Sci..

[39]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[40]  Khaled M. Elbassioni,et al.  A QPTAS for TSP with fat weakly disjoint neighborhoods in doubling metrics , 2010, SODA '10.

[41]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[42]  Anupam Gupta,et al.  Small Hop-diameter Sparse Spanners for Doubling Metrics , 2006, SODA '06.

[43]  Sariel Har-Peled,et al.  Fast construction of nets in low dimensional metrics, and their applications , 2004, SCG.

[44]  Maria-Florina Balcan,et al.  Distributed k-means and k-median clustering on general communication topologies , 2013, NIPS.

[45]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[46]  ParthasarathySrinivasan,et al.  Dependent rounding and its applications to approximation algorithms , 2006 .

[47]  Andreas Krause,et al.  Training Gaussian Mixture Models at Scale via Coresets , 2017, J. Mach. Learn. Res..

[48]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[49]  Xin Xiao,et al.  On the Sensitivity of Shape Fitting Problems , 2012, FSTTCS.

[50]  Leonidas J. Guibas,et al.  Deformable spanners and applications , 2004, SCG '04.

[51]  Anupam Gupta,et al.  Simpler Analyses of Local Search Algorithms for Facility Location , 2008, ArXiv.

[52]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[53]  Pankaj K. Agarwal,et al.  Exact and Approximation Algortihms for Clustering , 1997 .

[54]  Lee-Ad Gottlieb,et al.  The traveling salesman problem: low-dimensionality implies a polynomial time approximation scheme , 2011, STOC '12.

[55]  Bruce M. Maggs,et al.  On hierarchical routing in doubling metrics , 2005, SODA '05.