High probability generalization bounds for uniformly stable algorithms with nearly optimal rate

Algorithmic stability is a classical approach to understanding and analysis of the generalization error of learning algorithms. A notable weakness of most stability-based generalization bounds is that they hold only in expectation. Generalization with high probability has been established in a landmark paper of Bousquet and Elisseeff (2002) albeit at the expense of an additional $\sqrt{n}$ factor in the bound. Specifically, their bound on the estimation error of any $\gamma$-uniformly stable learning algorithm on $n$ samples and range in $[0,1]$ is $O(\gamma \sqrt{n \log(1/\delta)} + \sqrt{\log(1/\delta)/n})$ with probability $\geq 1-\delta$. The $\sqrt{n}$ overhead makes the bound vacuous in the common settings where $\gamma \geq 1/\sqrt{n}$. A stronger bound was recently proved by the authors (Feldman and Vondrak, 2018) that reduces the overhead to at most $O(n^{1/4})$. Still, both of these results give optimal generalization bounds only when $\gamma = O(1/n)$. We prove a nearly tight bound of $O(\gamma \log(n)\log(n/\delta) + \sqrt{\log(1/\delta)/n})$ on the estimation error of any $\gamma$-uniformly stable algorithm. It implies that for algorithms that are uniformly stable with $\gamma = O(1/\sqrt{n})$, estimation error is essentially the same as the sampling error. Our result leads to the first high-probability generalization bounds for multi-pass stochastic gradient descent and regularized ERM for stochastic convex problems with nearly optimal rate --- resolving open problems in prior work. Our proof technique is new and we introduce several analysis tools that might find additional applications.

[1]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[2]  Luc Devroye,et al.  Distribution-free performance bounds with the resubstitution error estimate (Corresp.) , 1979, IEEE Trans. Inf. Theory.

[3]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[4]  Gábor Lugosi,et al.  On the posterior-probability estimate of the error rate of nonparametric classification rules , 1993, IEEE Trans. Inf. Theory.

[5]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[6]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-one-Out Cross-Validation , 1997, COLT.

[7]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[8]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[9]  Tong Zhang,et al.  Leave-One-Out Bounds for Kernel Methods , 2003, Neural Computation.

[10]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[11]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[12]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[13]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[14]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[15]  T. Poggio,et al.  Sufficient Conditions for Uniform Stability of Regularization Algorithms , 2009 .

[16]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[17]  Sergei Vassilvitskii,et al.  Cross-Validation and Mean-Square Stability , 2011, ICS.

[18]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[19]  Christopher Ré,et al.  Toward a Noncommutative Arithmetic-geometric Mean Inequality: Conjectures, Case-studies, and Consequences , 2012, COLT.

[20]  Marc Teboulle,et al.  Smoothing and First Order Methods: A Unified Framework , 2012, SIAM J. Optim..

[21]  Sergei Vassilvitskii,et al.  Near-Optimal Bounds for Cross-Validation via Loss Stability , 2013, ICML.

[22]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[23]  Kfir Y. Levy,et al.  Fast Rates for Exp-concave Empirical Risk Minimization , 2015, NIPS.

[24]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[25]  Vitaly Feldman,et al.  Generalization of ERM in Stochastic Convex Optimization: The Dimension Strikes Back , 2016, NIPS.

[26]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[27]  Ohad Shamir,et al.  Without-Replacement Sampling for Stochastic Gradient Methods: Convergence Results and Application to Distributed Optimization , 2016, ArXiv.

[28]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[29]  Lorenzo Rosasco,et al.  Optimal Learning for Multi-pass Stochastic Gradient Methods , 2016, NIPS.

[30]  Alain Celisse,et al.  Stability revisited: new generalisation bounds for the Leave-one-Out , 2016, 1608.06412.

[31]  Thomas Steinke,et al.  Subgaussian Tail Bounds via Stability Arguments , 2017, ArXiv.

[32]  Ben London,et al.  A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent , 2017, NIPS.

[33]  Dacheng Tao,et al.  Algorithmic Stability and Hypothesis Complexity , 2017, ICML.

[34]  EU ANDREAS-MAURER. A Second-order Look at Stability and Generalization , 2017 .

[35]  Jeffrey F. Naughton,et al.  Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics , 2016, SIGMOD Conference.

[36]  Dimitris S. Papailiopoulos,et al.  Stability and Generalization of Learning Algorithms that Converge to Global Optima , 2017, ICML.

[37]  Bin Yu,et al.  Stability and Convergence Trade-off of Iterative Optimization Algorithms , 2018, ArXiv.

[38]  Csaba Szepesvári,et al.  An Exponential Tail Bound for Lq Stable Learning Rules. Application to k-Folds Cross-Validation , 2019, ISAIM.

[39]  Christoph H. Lampert,et al.  Data-Dependent Stability of Stochastic Gradient Descent , 2017, ICML.

[40]  Jan Vondrák,et al.  Generalization Bounds for Uniformly Stable Algorithms , 2018, NeurIPS.

[41]  Shiliang Sun,et al.  PAC-Bayes bounds for stable algorithms with instance-dependent priors , 2018, NeurIPS.

[42]  Vitaly Feldman,et al.  Privacy-preserving Prediction , 2018, COLT.

[43]  Alessandro Rudi,et al.  Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , 2018, NeurIPS.

[44]  Haipeng Luo,et al.  Hypothesis Set Stability and Generalization , 2019, NeurIPS.

[45]  Nicholas J. A. Harvey,et al.  Tight Analyses for Non-Smooth Stochastic Gradient Descent , 2018, COLT.

[46]  Kobbi Nissim,et al.  Concentration Bounds for High Sensitivity Functions Through Differential Privacy , 2019, J. Priv. Confidentiality.

[47]  Asuman E. Ozdaglar,et al.  Why random reshuffling beats stochastic gradient descent , 2015, Mathematical Programming.