One-Shot Coresets: The Case of k-Clustering

Scaling clustering algorithms to massive data sets is a challenging task. Recently, several successful approaches based on data summarization methods, such as coresets and sketches, were proposed. While these techniques provide provably good and small summaries, they are inherently problem dependent - the practitioner has to commit to a fixed clustering objective before even exploring the data. However, can one construct small data summaries for a wide range of clustering problems simultaneously? In this work, we affirmatively answer this question by proposing an efficient algorithm that constructs such one-shot summaries for k-clustering problems while retaining strong theoretical guarantees.

[1]  S. Dasgupta The hardness of k-means clustering , 2008 .

[2]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[3]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[4]  Andreas Krause,et al.  Uniform Deviation Bounds for k-Means Clustering , 2017, ICML.

[5]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[6]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[7]  Andreas Krause,et al.  Training Mixture Models at Scale via Coresets , 2017 .

[8]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[9]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[10]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[11]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[12]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[13]  Andreas Krause,et al.  Practical Coreset Constructions for Machine Learning , 2017, 1703.06476.

[14]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[15]  Andreas Krause,et al.  Scalable and Distributed Clustering via Lightweight Coresets , 2017, ArXiv.

[16]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[17]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[18]  Andreas Krause,et al.  Training Gaussian Mixture Models at Scale via Coresets , 2017, J. Mach. Learn. Res..