A new coreset framework for clustering

Given a metric space, the (k,z)-clustering problem consists of finding k centers such that the sum of the of distances raised to the power z of every point to its closest center is minimized. This encapsulates the famous k-median (z=1) and k-means (z=2) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as coresets, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases.

[1]  Jelani Nelson,et al.  Optimal terminal dimensionality reduction in Euclidean space , 2018, STOC.

[2]  Fabrizio Grandoni,et al.  Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma , 2019, STOC.

[3]  Vincent Cohen-Addad,et al.  On the Local Structure of Stable Clustering Instances , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[4]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[5]  Silvio Lattanzi,et al.  One-Shot Coresets: The Case of k-Clustering , 2017, AISTATS.

[6]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[7]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[8]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[9]  Christian Sohler,et al.  BICO: BIRCH Meets Coresets for k-Means Clustering , 2013, ESA.

[10]  Andreas Krause,et al.  Uniform Deviation Bounds for k-Means Clustering , 2017, ICML.

[11]  Piotr Indyk,et al.  Composable Core-sets for Determinant Maximization Problems via Spectral Spanners , 2018, SODA.

[12]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[13]  Nisheeth K. Vishnoi,et al.  Coresets for clustering in Euclidean spaces: importance sampling is nearly optimal , 2020, STOC.

[14]  David Eisenstat,et al.  The VC dimension of k-fold union , 2007, Inf. Process. Lett..

[15]  Alexander Munteanu,et al.  Coresets-Methods and History: A Theoreticians Design Pattern for Approximation and Streaming Algorithms , 2017, KI - Künstliche Intelligenz.

[16]  Nabil H. Mustafa,et al.  Tight Lower Bounds on the VC-dimension of Geometric Set Systems , 2019, J. Mach. Learn. Res..

[17]  Michael Elkin,et al.  Terminal embeddings , 2017, Theor. Comput. Sci..

[18]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[19]  Robert Krauthgamer,et al.  Coresets for Clustering in Graphs of Bounded Treewidth , 2019, ICML.

[20]  Christian Sohler,et al.  Fair Coresets and Streaming Algorithms for Fair k-Means Clustering , 2018, ArXiv.

[21]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[22]  M. Talagrand Majorizing measures: the generic chaining , 1996 .

[23]  Michal Pilipczuk,et al.  Efficient approximation schemes for uniform-cost clustering problems in planar graphs , 2019, ESA.

[24]  Kasper Green Larsen,et al.  Optimality of the Johnson-Lindenstrauss Lemma , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[25]  Konstantin Makarychev,et al.  Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering , 2018, STOC.

[26]  Robert Krauthgamer,et al.  Bounded geometries, fractals, and low-distortion embeddings , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[27]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[28]  Vladimir Braverman,et al.  Streaming Coreset Constructions for M-Estimators , 2019, APPROX-RANDOM.

[29]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[30]  J. Matou On Approximate Geometric K-clustering , 1999 .

[31]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[32]  Vahab S. Mirrokni,et al.  Composable core-sets for diversity and coverage maximization , 2014, PODS.

[33]  Konstantin Makarychev,et al.  Nonlinear dimension reduction via outer Bi-Lipschitz extensions , 2018, STOC.

[34]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[35]  Vladimir Braverman,et al.  Clustering High Dimensional Dynamic Data Streams , 2017, ICML.

[36]  Trevor Campbell,et al.  Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.

[37]  Ibrahim Jubran,et al.  Fast and Accurate Least-Mean-Squares Solvers for High Dimensional Data , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Maria-Florina Balcan,et al.  Distributed k-means and k-median clustering on general communication topologies , 2013, NIPS.

[39]  Pierre Hansen,et al.  J-MEANS: a new local search heuristic for minimum sum of squares clustering , 1999, Pattern Recognit..

[40]  David P. Woodruff,et al.  Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[41]  David P. Woodruff,et al.  Strong Coresets for Subspace Approximation and k-Median in Nearly Linear Time , 2019, ArXiv.

[42]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[43]  Jason Li,et al.  On the Fixed-Parameter Tractability of Capacitated Clustering , 2022, ICALP.

[44]  Geppino Pucci,et al.  Fast Coreset-based Diversity Maximization under Matroid Constraints , 2018, WSDM.

[45]  Nisheeth K. Vishnoi,et al.  Coresets for Clustering with Fairness Constraints , 2019, NeurIPS.

[46]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[47]  Kristian Kersting,et al.  Core Dependency Networks , 2018, AAAI.

[48]  David P. Woodruff,et al.  On Coresets for Logistic Regression , 2018, NeurIPS.

[49]  Yi Li,et al.  Learnability and the doubling dimension , 2006, NIPS.

[50]  Robert Krauthgamer,et al.  Coresets for Ordered Weighted Clustering , 2019, ICML.

[51]  Robert Krauthgamer,et al.  Coresets for Clustering in Excluded-minor Graphs and Beyond , 2020, SODA.

[52]  J. Lamperti ON CONVERGENCE OF STOCHASTIC PROCESSES , 1962 .

[53]  Jian Li,et al.  Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[54]  Ittai Abraham,et al.  Object location using path separators , 2006, PODC '06.