Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent

In the vanishing learning rate regime, stochastic gradient descent (SGD) is now relatively well understood. In this work, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and discussing their implications. The main contributions of this work are to derive the stationary distribution for discrete-time SGD in a quadratic loss function with and without momentum; in particular, one implication of our result is that the fluctuation caused by discrete-time dynamics takes a distorted shape and is dramatically larger than a continuous-time theory could predict. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of minibatch noise, the optimal Bayesian inference, the escape rate from a sharp minimum, and the stationary covariance of a few second-order methods including damped Newton’s method, natural gradient descent, and Adam.

[1]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[2]  Masashi Sugiyama,et al.  A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.

[3]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[4]  W. Ebeling Stochastic Processes in Physics and Chemistry , 1995 .

[5]  P. Hänggi,et al.  Reaction-rate theory: fifty years after Kramers , 1990 .

[6]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[7]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[8]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[9]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[10]  Lei Wu How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .

[11]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[12]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[13]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[14]  Zhang Zhiyi,et al.  On the Distributional Properties of Adaptive Gradients , 2021, UAI.

[15]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[16]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[17]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[18]  J. Zico Kolter,et al.  Generalization in Deep Networks: The Role of Distance from Initialization , 2019, ArXiv.

[19]  Tomaso A. Poggio,et al.  Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.

[20]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[21]  Murat A. Erdogdu,et al.  Hausdorff dimension, heavy tails, and generalization in neural networks , 2020, NeurIPS.

[22]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[23]  Lin Xiao,et al.  Understanding the Role of Momentum in Stochastic Gradient Methods , 2019, NeurIPS.

[24]  Liu Ziyin,et al.  On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes , 2021, ArXiv.

[25]  Yurii Nesterov,et al.  Lectures on Convex Optimization , 2018 .

[26]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.

[27]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[28]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[29]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[30]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[31]  Nicholas J. Higham,et al.  Solving a Quadratic Matrix Equation by Newton's Method with Exact Line Searches , 2001, SIAM J. Matrix Anal. Appl..

[32]  L. Landau,et al.  statistical-physics-part-1 , 1958 .

[33]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[34]  Zhi-Ming Ma,et al.  Dynamic of Stochastic Gradient Descent with State-Dependent Noise , 2020, ArXiv.

[35]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[36]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[37]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[38]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[39]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[40]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[41]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[42]  H. Kramers Brownian motion in a field of force and the diffusion model of chemical reactions , 1940 .

[43]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[44]  D. Sherrington Stochastic Processes in Physics and Chemistry , 1983 .

[45]  Takashi Mori,et al.  Improved generalization by noise enhancement , 2020, ArXiv.

[46]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[47]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[48]  A. Einstein Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen [AdP 17, 549 (1905)] , 2005, Annalen der Physik.

[49]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[52]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[53]  Michael W. Mahoney,et al.  Multiplicative noise and heavy tails in stochastic optimization , 2020, ICML.

[54]  Liu Ziyin,et al.  Logarithmic landscape and power-law escape rate of SGD , 2021, ArXiv.

[55]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[56]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[57]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[58]  Jerry Ma,et al.  Quasi-hyperbolic momentum and Adam for deep learning , 2018, ICLR.