The Power of Uniform Sampling for Coresets

Motivated by practical generalizations of the classic k-median and k-means objectives, such as clustering with size constraints, fair clustering, and Wasserstein barycenter, we introduce a meta-theorem for designing coresets for constrained-clustering problems. The meta-theorem reduces the task of coreset construction to one on a bounded number of ring instances with a much-relaxed additive error. This reduction enables us to construct coresets using uniform sampling, in contrast to the widely-used importance sampling, and consequently we can easily handle constrained objectives. Notably and perhaps surprisingly, this simpler sampling scheme can yield coresets whose size is independent of n, the number of input points. Our technique yields smaller coresets, and sometimes the first coresets, for a large number of constrained clustering problems, including capacitated clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in minor-excluded graph, and polygon clustering under Fréchet and Hausdorff distance. Finally, our technique yields also smaller coresets for 1-median in low-dimensional Euclidean spaces, specifically of size $\tilde{O}(\varepsilon^{-15})$ in $\mathbb{R}^{2}$ and $\tilde{O}(\varepsilon^{-16})$ in $\mathbb{R}^{3}$.

[1]  Omar Ali Sheikh-Omar,et al.  An Empirical Evaluation of k-Means Coresets , 2022, ESA.

[2]  Ilya P. Razenshteyn,et al.  Performance of Johnson--Lindenstrauss Transform for $k$-Means and $k$-Medians Clustering , 2022, SIAM Journal on Computing.

[3]  Kasper Green Larsen,et al.  Towards optimal lower bounds for k-median and k-means coresets , 2022, STOC.

[4]  M. Buchin,et al.  Coresets for $(k, \ell)$-Median Clustering under the Fr\'echet Distance , 2021, 2104.09392.

[5]  Nisheeth K. Vishnoi,et al.  Coresets for Time Series Clustering , 2021, NeurIPS.

[6]  Samson Zhou,et al.  Dimensionality Reduction for Wasserstein Barycenter , 2021, NeurIPS.

[7]  Robert Krauthgamer,et al.  Coresets for Kernel Clustering , 2021, ArXiv.

[8]  Stefano Leonardi,et al.  Algorithms for fair k-clustering with multiple protected attributes , 2021, Oper. Res. Lett..

[9]  Robert Krauthgamer,et al.  Coresets for Clustering with Missing Values , 2021, NeurIPS.

[10]  Anup B. Rao,et al.  Coresets for Classification - Simplified and Strengthened , 2021, NeurIPS.

[11]  Abhinandan Nath Coresets for k-median clustering under Fréchet and Hausdorff distances , 2021, ArXiv.

[12]  David Saulpic,et al.  A new coreset framework for clustering , 2021, STOC.

[13]  Fedor V. Fomin,et al.  On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their Applications , 2020, ICALP.

[14]  Robert Krauthgamer,et al.  Coresets for Clustering in Excluded-minor Graphs and Beyond , 2020, SODA.

[15]  Ioannis Psarros,et al.  The VC Dimension of Metric Balls under Fréchet and Hausdorff Distances , 2019, Discrete & Computational Geometry.

[16]  Nisheeth K. Vishnoi,et al.  Coresets for Regressions with Panel Data , 2020, NeurIPS.

[17]  Murad Tukan,et al.  Coresets for Near-Convex Functions , 2020, NeurIPS.

[18]  Nisheeth K. Vishnoi,et al.  Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal , 2020, STOC.

[19]  Ibrahim Jubran,et al.  Sets Clustering , 2020, ICML.

[20]  Dan Feldman,et al.  Core‐sets: An updated survey , 2019, WIREs Data Mining Knowl. Discov..

[21]  Robert Krauthgamer,et al.  Coresets for Clustering in Graphs of Bounded Treewidth , 2019, ICML.

[22]  Jason Li,et al.  On the Fixed-Parameter Tractability of Capacitated Clustering , 2022, ICALP.

[23]  Fabrizio Grandoni,et al.  Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma , 2019, STOC.

[24]  Nisheeth K. Vishnoi,et al.  Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[25]  Dan Feldman,et al.  Coresets for Gaussian Mixture Models of Any Shape , 2019, ArXiv.

[26]  Robert Krauthgamer,et al.  Coresets for Ordered Weighted Clustering , 2019, ICML.

[27]  Dan Feldman,et al.  k-Means Clustering of Lines for Big Data , 2019, NeurIPS.

[28]  Hu Ding,et al.  Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction , 2019, ESA.

[29]  Deeparnab Chakrabarty,et al.  Fair Algorithms for Clustering , 2019, NeurIPS.

[30]  Christian Sohler,et al.  Fair Coresets and Streaming Algorithms for Fair k-Means Clustering , 2018, ArXiv.

[31]  Konstantin Makarychev,et al.  Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering , 2018, STOC.

[32]  Jelani Nelson,et al.  Optimal terminal dimensionality reduction in Euclidean space , 2018, STOC.

[33]  David P. Woodruff,et al.  Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[34]  David P. Woodruff,et al.  On Coresets for Logistic Regression , 2018, NeurIPS.

[35]  Jian Li,et al.  Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[36]  Jeff M. Phillips,et al.  Near-Optimal Coresets of Kernel Density Estimates , 2018, Discrete & Computational Geometry.

[37]  Silvio Lattanzi,et al.  One-Shot Coresets: The Case of k-Clustering , 2017, AISTATS.

[38]  Andreas Krause,et al.  Scalable k -Means Clustering via Lightweight Coresets , 2017, KDD.

[39]  Silvio Lattanzi,et al.  Fair Clustering Through Fairlets , 2018, NIPS.

[40]  Alexander Munteanu,et al.  Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms , 2017, KI - Künstliche Intelligenz.

[41]  Vladimir Braverman,et al.  Clustering High Dimensional Dynamic Data Streams , 2017, ICML.

[42]  Vladimir Braverman,et al.  Clustering Problems on Sliding Windows , 2016, SODA.

[43]  Nicolas Bousquet,et al.  VC-dimension and Erdős-Pósa property , 2014, Discret. Math..

[44]  Christian Sohler,et al.  BICO: BIRCH Meets Coresets for k-Means Clustering , 2013, ESA.

[45]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[46]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[47]  Michael Langberg,et al.  Universal epsilon-approximators for Integrals , 2010, ACM-SIAM Symposium on Discrete Algorithms.

[48]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[49]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[50]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[51]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[52]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[53]  Bernard Chazelle,et al.  The discrepancy method - randomness and complexity , 2000 .

[54]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[55]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .