Scalable k -Means Clustering via Lightweight Coresets

\emphCoresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of lightweight coresets that allows for both multiplicative and additive errors. We provide a single algorithm to construct lightweight coresets for k -means clustering as well as soft and hard Bregman clustering. The algorithm is substantially faster than existing constructions, embarrassingly parallel, and the resulting coresets are smaller. We further show that the proposed approach naturally generalizes to statistical k -means clustering and that, compared to existing results, it can be used to compute smaller summaries for empirical risk minimization. In extensive experiments, we demonstrate that the proposed algorithm outperforms existing data summarization strategies in practice.

[1]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[2]  Silvio Lattanzi,et al.  One-Shot Coresets: The Case of k-Clustering , 2017, AISTATS.

[3]  Sariel Har-Peled,et al.  Coresets for $k$-Means and $k$-Median Clustering and their Applications , 2018, STOC 2004.

[4]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[5]  Alexander J. Smola,et al.  Communication Efficient Coresets for Empirical Loss Minimization , 2015, UAI.

[6]  Andreas Krause,et al.  Training Mixture Models at Scale via Coresets , 2017 .

[7]  Andreas Krause,et al.  Training Gaussian Mixture Models at Scale via Coresets , 2017, J. Mach. Learn. Res..

[8]  Sariel Har-Peled Geometric Approximation Algorithms , 2011 .

[9]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC '11.

[10]  Andreas Krause,et al.  The next big one: Detecting earthquakes and other rare events from community-based sensors , 2011, Proceedings of the 10th ACM/IEEE International Conference on Information Processing in Sensor Networks.

[11]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[12]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[13]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[14]  Andreas Krause,et al.  Uniform Deviation Bounds for k-Means Clustering , 2017, ICML.

[15]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[16]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[17]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[18]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[19]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[20]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[21]  Trevor Campbell,et al.  Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.

[22]  David H. Mathews,et al.  Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change , 2006, BMC Bioinformatics.

[23]  Sanjoy Dasgupta,et al.  Moment-based Uniform Deviation Bounds for k-means and Friends , 2013, NIPS.

[24]  Hunter Johnson Definable families of finite Vapnik Chervonenkis dimension , 2008 .

[25]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[26]  J. Matousek,et al.  Geometric Discrepancy: An Illustrated Guide , 2009 .

[27]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.