Law of Balance and Stationary Distribution of Stochastic Gradient Descent

The stochastic gradient descent (SGD) algorithm is the algorithm we use to train neural networks. However, it remains poorly understood how the SGD navigates the highly nonlinear and degenerate loss landscape of a neural network. In this work, we prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. Because the difference between a simple diffusion process and SGD dynamics is the most significant when symmetries are present, our theory implies that the loss function symmetries constitute an essential probe of how SGD works. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.

[1]  S. Ganguli,et al.  Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks , 2023, ArXiv.

[2]  Tomer Galanti,et al.  The Probabilistic Stability of Stochastic Gradient Descent , 2023, ArXiv.

[3]  Suriya Gunasekar,et al.  (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability , 2023, ArXiv.

[4]  Jason D. Lee,et al.  Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability , 2022, ICLR.

[5]  Raphael Berthier Incremental Learning in Diagonal Linear Networks , 2022, Journal of machine learning research.

[6]  Liu Ziyin,et al.  Exact Phase Transitions in Deep Learning , 2022, ArXiv.

[7]  Zihao Wang,et al.  Posterior Collapse of a Linear Latent Variable Model , 2022, NeurIPS.

[8]  Liu Ziyin,et al.  Exact solutions of a deep linear network , 2022, NeurIPS.

[9]  T. Zhao,et al.  Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect , 2021, ICLR.

[10]  Nicolas Flammarion,et al.  Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity , 2021, NeurIPS.

[11]  Takashi Mori,et al.  Power-Law Escape Rate of SGD , 2021, ICML.

[12]  Vladimir Braverman,et al.  Benign Overfitting of Constant-Stepsize SGD for Linear Regression , 2021, COLT.

[13]  Ameet Talwalkar,et al.  Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , 2021, ICLR.

[14]  Sanjeev Arora,et al.  On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs) , 2021, NeurIPS.

[15]  Takashi Mori,et al.  Strength of Minibatch Noise in SGD , 2021, ICLR.

[16]  Soham De,et al.  On the Origin of Implicit Regularization in Stochastic Gradient Descent , 2021, ICLR.

[17]  Surya Ganguli,et al.  Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics , 2020, ICLR.

[18]  Haim Sompolinsky,et al.  Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization , 2020, Physical Review X.

[19]  Liu Ziyin,et al.  Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent , 2020, ICML.

[20]  D. Barrett,et al.  Implicit Gradient Regularization , 2020, ICLR.

[21]  Guillermo Valle Pérez,et al.  Is SGD a Bayesian sampler? Well, almost , 2020, J. Mach. Learn. Res..

[22]  Jonas Latz,et al.  Analysis of stochastic gradient descent in continuous time , 2020, Statistics and Computing.

[23]  Ohad Shamir,et al.  Is Local SGD Better than Minibatch SGD? , 2020, ICML.

[24]  Masashi Sugiyama,et al.  A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.

[25]  E Weinan,et al.  Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations , 2018, J. Mach. Learn. Res..

[26]  Sho Yaida,et al.  Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.

[27]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[28]  David Rolnick,et al.  How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[29]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[30]  Boris Hanin,et al.  Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , 2018, NeurIPS.

[31]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[32]  Z. Papic,et al.  Weak ergodicity breaking from quantum many-body scars , 2017, Nature Physics.

[33]  Justin A. Sirignano,et al.  Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem , 2017, Stochastic Systems.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.

[36]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[37]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[38]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[39]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[40]  Agnieszka B. Malinowska,et al.  Noether's Theorem for Control Problems on Time Scales , 2014, 1406.0705.

[41]  Brendan Fong,et al.  A Noether Theorem for Markov Processes , 2012, 1203.2035.

[42]  J. Mauro,et al.  Continuously broken ergodicity. , 2007, The Journal of chemical physics.

[43]  T. Misawa Noether's theorem in symmetric stochastic calculus of variations , 1988 .

[44]  Mor Shpigel Nacson,et al.  Implicit Bias of the Step Size in Linear Diagonal Neural Networks , 2022, ICML.

[45]  Liu Ziyin,et al.  SGD May Never Escape Saddle Points , 2021, ArXiv.

[46]  Alain Durmus,et al.  Convergence rates and approximation results for SGD and its continuous-time counterpart , 2021, COLT.

[47]  J. Mauro Broken Ergodicity , 2021, Materials Kinetics.

[48]  Lei Wu How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .

[49]  M. Mirzakhani,et al.  Introduction to Ergodic theory , 2010 .

[50]  Paul Embrechts,et al.  Stochastic processes in insurance and finance , 2001 .

[51]  W. Ebeling Stochastic Processes in Physics and Chemistry , 1995 .

[52]  Thirumalai,et al.  Activated dynamics, loss of ergodicity, and transport in supercooled liquids. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.