Practical Coreset Constructions for Machine Learning

We investigate coresets - succinct, small summaries of large data sets - so that solutions found on the summary are provably competitive with solution found on the full data set. We provide an overview over the state-of-the-art in coreset construction for machine learning. In Section 2, we present both the intuition behind and a theoretically sound framework to construct coresets for general problems and apply it to $k$-means clustering. In Section 3 we summarize existing coreset construction algorithms for a variety of machine learning problems such as maximum likelihood estimation of mixture models, Bayesian non-parametric models, principal component analysis, regression and general empirical risk minimization.

[1]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[2]  Andreas Krause,et al.  Training Gaussian Mixture Models at Scale via Coresets , 2017, J. Mach. Learn. Res..

[3]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[4]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[5]  Trevor Campbell,et al.  Coresets for Scalable Bayesian Logistic Regression , 2016, NIPS.

[6]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[7]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[8]  Sariel Har-Peled Geometric Approximation Algorithms , 2011 .

[9]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[10]  Andreas Krause,et al.  Linear-Time Outlier Detection via Sensitivity , 2016, IJCAI.

[11]  Andreas Krause,et al.  Fast and Provably Good Seedings for k-Means , 2016, NIPS.

[12]  Rameshwar Pratap,et al.  Faster Coreset Construction for Projective Clustering via Low-Rank Approximation , 2018, IWOCA.

[13]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[14]  Alexander J. Smola,et al.  Communication Efficient Coresets for Empirical Loss Minimization , 2015, UAI.

[15]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[16]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[17]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[18]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[19]  Christos Boutsidis,et al.  Near-Optimal Column-Based Matrix Reconstruction , 2014, SIAM J. Comput..

[20]  Maxim Sviridenko,et al.  A Bi-Criteria Approximation Algorithm for k-Means , 2015, APPROX-RANDOM.

[21]  Andreas Krause,et al.  Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning , 2015, AISTATS.

[22]  Pankaj K. Agarwal,et al.  Approximation Algorithms for k-Line Center , 2002, ESA.

[23]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[24]  Rameshwar Pratap,et al.  Faster coreset construction for subspace and projective clustering , 2016, ArXiv.

[25]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[26]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[27]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[28]  Bernard Chazelle,et al.  On linear-time deterministic algorithms for optimization problems in fixed dimension , 1996, SODA '93.

[29]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[30]  Christos Boutsidis,et al.  Near-Optimal Coresets for Least-Squares Regression , 2012, IEEE Transactions on Information Theory.

[31]  Andreas Krause,et al.  Training Mixture Models at Scale via Coresets , 2017 .

[32]  Christos Boutsidis,et al.  Random Projections for the Nonnegative Least-Squares Problem , 2008, ArXiv.

[33]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[34]  Andreas Krause,et al.  Uniform Deviation Bounds for Unbounded Loss Functions like k-Means , 2017, ICML 2017.

[35]  Yee Whye Teh,et al.  Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[36]  Sanjeev Arora,et al.  LEARNING MIXTURES OF SEPARATED NONSPHERICAL GAUSSIANS , 2005, math/0503457.

[37]  Xin Xiao,et al.  A near-linear algorithm for projective clustering integer points , 2012, SODA.

[38]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[39]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[40]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[41]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[42]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[43]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[44]  J. Matousek,et al.  Geometric Discrepancy: An Illustrated Guide , 2009 .

[45]  Joseph S. B. Mitchell,et al.  Approximate minimum enclosing balls in high dimensions using core-sets , 2003, ACM J. Exp. Algorithmics.

[46]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[47]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[48]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[49]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[50]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[51]  Kenneth L. Clarkson,et al.  Optimal core-sets for balls , 2008, Comput. Geom..

[52]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[53]  Hunter Johnson Definable families of finite Vapnik Chervonenkis dimension , 2008 .

[54]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[55]  Sariel Har-Peled,et al.  No, Coreset, No Cry , 2004, FSTTCS.