Stability of SGD: Tightness Analysis and Improved Bounds

Stochastic Gradient Descent (SGD) based methods have been widely used for training large-scale machine learning models that also generalize well in practice. Several explanations have been offered for this generalization performance, a prominent one being algorithmic stability [18]. However, there are no known examples of smooth loss functions for which the analysis can be shown to be tight. Furthermore, apart from properties of the loss function, data distribution has also been shown to be an important factor in generalization performance. This raises the question: is the stability analysis of [18] tight for smooth functions, and if not, for what kind of loss functions and data distributions can the stability analysis be improved? In this paper we first settle open questions regarding tightness of bounds in the data-independent setting: we show that for general datasets, the existing analysis for convex and strongly-convex loss functions is tight, but it can be improved for non-convex loss functions. Next, we give novel and improved data-dependent bounds: we show stability upper bounds for a large class of convex regularized loss functions, with negligible regularization parameters, and improve existing data-dependent bounds in the non-convex setting. We hope that our results will initiate further efforts to better understand the data-dependent setting under non-convex loss functions, leading to an improved understanding of generalization abilities of deep networks.

[1]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[2]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[3]  Yair Carmon,et al.  Lower bounds for finding stationary points II: first-order methods , 2017, Mathematical Programming.

[4]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[5]  T. Poggio,et al.  STABILITY RESULTS IN LEARNING THEORY , 2005 .

[6]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[7]  Kunal Talwar,et al.  Private stochastic convex optimization: optimal rates in linear time , 2020, STOC.

[8]  Luc Devroye,et al.  Distribution-free performance bounds with the resubstitution error estimate (Corresp.) , 1979, IEEE Trans. Inf. Theory.

[9]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[10]  Bin Yu,et al.  Stability and Convergence Trade-off of Iterative Optimization Algorithms , 2018, ArXiv.

[11]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12]  Kai Zheng,et al.  Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints , 2017, COLT.

[13]  Christoph H. Lampert,et al.  Data-Dependent Stability of Stochastic Gradient Descent , 2017, ICML.

[14]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[15]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[16]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.

[17]  Jian Li,et al.  On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning , 2019, ICLR.

[18]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[19]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[20]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[21]  W. Rogers,et al.  A Finite Sample Distribution-Free Performance Bound for Local Discrimination Rules , 1978 .

[22]  Luc Devroye,et al.  Distribution-free inequalities for the deleted and holdout error estimates , 1979, IEEE Trans. Inf. Theory.

[23]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[26]  Vitaly Feldman,et al.  Privacy Amplification by Iteration , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[27]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[28]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[29]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[30]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[31]  Jeffrey F. Naughton,et al.  Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics , 2016, SIGMOD Conference.

[32]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[33]  Raef Bassily,et al.  Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses , 2020, NeurIPS.

[34]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[35]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[36]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[37]  Jan Vondrák,et al.  High probability generalization bounds for uniformly stable algorithms with nearly optimal rate , 2019, COLT.