Faster PAC Learning and Smaller Coresets via Smoothed Analysis

PAC-learning usually aims to compute a small subset ($\varepsilon$-sample/net) from $n$ items, that provably approximates a given loss function for every query (model, classifier, hypothesis) from a given set of queries, up to an additive error $\varepsilon\in(0,1)$. Coresets generalize this idea to support multiplicative error $1\pm\varepsilon$. Inspired by smoothed analysis, we suggest a natural generalization: approximate the \emph{average} (instead of the worst-case) error over the queries, in the hope of getting smaller subsets. The dependency between errors of different queries implies that we may no longer apply the Chernoff-Hoeffding inequality for a fixed query, and then use the VC-dimension or union bound. This paper provides deterministic and randomized algorithms for computing such coresets and $\varepsilon$-samples of size independent of $n$, for any finite set of queries and loss function. Example applications include new and improved coreset constructions for e.g. streaming vector summarization [ICML'17] and $k$-PCA [NIPS'16]. Experimental results with open source code are provided.

[1]  Philippe Flajolet,et al.  Birthday Paradox, Coupon Collectors, Caching Algorithms and Self-Organizing Search , 1992, Discret. Appl. Math..

[2]  Ibrahim Jubran,et al.  Introduction to Coresets: Accurate Coresets , 2019, ArXiv.

[3]  Jeff M. Phillips,et al.  Improved Practical Matrix Sketching with Guarantees , 2014, IEEE Transactions on Knowledge and Data Engineering.

[4]  Stanislav Minsker Geometric median and robust estimation in Banach spaces , 2013, 1308.1334.

[5]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[6]  Travis E. Oliphant,et al.  Guide to NumPy , 2015 .

[7]  R. J. Webster,et al.  Carathéodory's Theorem , 1972, Canadian Mathematical Bulletin.

[8]  Dan Feldman,et al.  Coresets for Vector Summarization with Applications to Network Graphs , 2017, ICML.

[9]  Xinjia Chen,et al.  A New Generalization of Chebyshev Inequality for Random Vectors , 2007, ArXiv.

[10]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Tamir Tassa,et al.  More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data , 2015, KDD.

[13]  Charu C. Aggarwal,et al.  Neural Networks and Deep Learning , 2018, Springer International Publishing.

[14]  Artem Barger,et al.  k-Means for Streaming and Distributed Big Sparse Data , 2015, SDM.

[15]  Richard Peng,et al.  Uniform Sampling for Matrix Approximation , 2014, ITCS.

[16]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[17]  Dimitris Papailiopoulos,et al.  Provable deterministic leverage score sampling , 2014, KDD.

[18]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[19]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[20]  C. Carathéodory Über den Variabilitätsbereich der Koeffizienten von Potenzreihen, die gegebene Werte nicht annehmen , 1907 .

[21]  S. Muthukrishnan,et al.  Relative-Error CUR Matrix Decompositions , 2007, SIAM J. Matrix Anal. Appl..

[22]  Shang-Hua Teng,et al.  Smoothed analysis: an attempt to explain the behavior of algorithms in practice , 2009, CACM.

[23]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[24]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[25]  Jeff M. Phillips,et al.  Near-Optimal Coresets of Kernel Density Estimates , 2018, Discrete & Computational Geometry.

[26]  Sariel Har-Peled Geometric Approximation Algorithms , 2011 .

[27]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[28]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[29]  Xin Xiao,et al.  On the Sensitivity of Shape Fitting Problems , 2012, FSTTCS.

[30]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[31]  Dan Feldman,et al.  Dimensionality Reduction of Massive Sparse Datasets Using Coresets , 2015, NIPS.

[32]  Pierre-Olivier Amblard,et al.  Determinantal Point Processes for Coresets , 2018, J. Mach. Learn. Res..

[33]  Daniel M. Kane,et al.  A Derandomized Sparse Johnson-Lindenstrauss Transform , 2010, Electron. Colloquium Comput. Complex..

[34]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[35]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[36]  S. Bergman The kernel function and conformal mapping , 1950 .

[37]  Christopher Ré,et al.  Weighted SGD for ℓp Regression with Randomized Preconditioning , 2016, SODA.

[38]  Michael B. Cohen,et al.  Input Sparsity Time Low-rank Approximation via Ridge Leverage Score Sampling , 2015, SODA.

[39]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[40]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[41]  François Kawala,et al.  Prédictions d'activité dans les réseaux sociaux en ligne , 2013 .

[42]  A. Juditsky,et al.  Large Deviations of Vector-valued Martingales in 2-Smooth Normed Spaces , 2008, 0809.0813.

[43]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[44]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[45]  Jirí Matousek,et al.  Approximations and optimal geometric divide-and-conquer , 1991, STOC '91.

[46]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[47]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[48]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[49]  Daniel B. Work,et al.  Using coarse GPS data to quantify city-scale transportation system resilience to extreme events , 2015, ArXiv.

[50]  Fred L. Drake,et al.  Python 3 Reference Manual , 2009 .

[51]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[52]  David P. Woodruff,et al.  Optimal Approximate Matrix Product in Terms of Stable Rank , 2015, ICALP.

[53]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[54]  Andreas Krause,et al.  Scalable k -Means Clustering via Lightweight Coresets , 2017, KDD.

[55]  Richard Peng,et al.  Lp Row Sampling by Lewis Weights , 2015, STOC.

[56]  Alaa Maalouf,et al.  Tight Sensitivity Bounds For Smaller Coresets , 2019, KDD.