On a continuous time model of gradient descent dynamics and instability in deep learning

The recipe behind the success of deep learning has been the combination of neural networks and gradient-based optimization. Understanding the behavior of gradient descent however, and particularly its instability, has lagged behind its empirical success. To add to the theoretical tools available to study gradient descent we propose the principal flow (PF), a continuous time flow that approximates gradient descent dynamics. To our knowledge, the PF is the only continuous flow that captures the divergent and oscillatory behaviors of gradient descent, including escaping local minima and saddle points. Through its dependence on the eigendecomposition of the Hessian the PF sheds light on the recently observed edge of stability phenomena in deep learning. Using our new understanding of instability we propose a learning rate adaptation method which enables us to control the trade-off between training stability and test set evaluation performance.

[1]  Taiki Miyagawa Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis , 2022, NeurIPS.

[2]  Jason D. Lee,et al.  Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability , 2022, ICLR.

[3]  E. Hairer,et al.  Geometric Numerical Integration , 2022, Oberwolfach Reports.

[4]  Zachary Chase Lipton,et al.  On the Maximum Hessian Eigenvalue and Generalization , 2022, ICBINB.

[5]  Sanjeev Arora,et al.  Understanding Gradient Descent on Edge of Stability in Deep Learning , 2022, ICML.

[6]  S. Sra,et al.  Understanding the unstable convergence of gradient descent , 2022, ICML.

[7]  Michael I. Jordan,et al.  Last-Iterate Convergence of Saddle Point Optimizers via High-Resolution Differential Equations , 2021, ArXiv.

[8]  Orhan Firat,et al.  A Loss Curvature Perspective on Training Instability in Deep Learning , 2021, ArXiv.

[9]  Phillip E. Pope,et al.  Stochastic Training is Not Necessary for Generalization , 2021, ICLR.

[10]  Nadav Cohen,et al.  Continuous vs. Discrete Optimization of Deep Neural Networks , 2021, NeurIPS.

[11]  David G. T. Barrett,et al.  Discretization Drift in Two-Player Games , 2021, ICML.

[12]  Ameet Talwalkar,et al.  Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , 2021, ICLR.

[13]  Takashi Mori,et al.  Strength of Minibatch Noise in SGD , 2021, ICLR.

[14]  Soham De,et al.  On the Origin of Implicit Regularization in Stochastic Gradient Descent , 2021, ICLR.

[15]  O. Shamir,et al.  Implicit Regularization in ReLU Networks with the Square Loss , 2020, COLT.

[16]  Liu Ziyin,et al.  Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent , 2020, ICML.

[17]  Jeff Donahue,et al.  Training Generative Adversarial Networks by Solving Ordinary Differential Equations , 2020, NeurIPS.

[18]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[19]  D. Barrett,et al.  Implicit Gradient Regularization , 2020, ICLR.

[20]  Jaehoon Lee,et al.  Finite Versus Infinite Neural Networks: an Empirical Study , 2020, NeurIPS.

[21]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[22]  Michael I. Jordan,et al.  On dissipative symplectic integration with applications to gradient-based optimization , 2020, Journal of Statistical Mechanics: Theory and Experiment.

[23]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[24]  Kyunghyun Cho,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[25]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[26]  Nikola B. Kovachki,et al.  Continuous Time Analysis of Momentum Methods , 2019, J. Mach. Learn. Res..

[27]  Daniel P. Robinson,et al.  Conformal symplectic and relativistic optimization , 2019, NeurIPS.

[28]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[29]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[30]  Wei Hu,et al.  Width Provably Matters in Optimization for Deep Linear Neural Networks , 2019, ICML.

[31]  Zhi Zhang,et al.  Bag of Tricks for Image Classification with Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Vardan Papyan,et al.  The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .

[33]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[34]  Michael I. Jordan,et al.  Understanding the acceleration phenomenon via high-resolution differential equations , 2018, Mathematical Programming.

[35]  Surya Ganguli,et al.  An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.

[36]  Sho Yaida,et al.  Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.

[37]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[38]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[39]  Yoshua Bengio,et al.  On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[40]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[41]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[42]  Philip M. Long,et al.  Representing smooth functions as compositions of near-identity functions with implications for deep network optimization , 2018, ArXiv.

[43]  Abien Fred Agarap Deep Learning using Rectified Linear Units (ReLU) , 2018, ArXiv.

[44]  Yang Song,et al.  Accelerating Natural Gradient with Higher-Order Invariance , 2018, ICML.

[45]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[46]  Philip M. Long,et al.  Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks , 2018, Neural Computation.

[47]  Thore Graepel,et al.  The Mechanics of n-Player Differentiable Games , 2018, ICML.

[48]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[49]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[50]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[51]  J. Zico Kolter,et al.  Gradient descent GAN optimization is locally stable , 2017, NIPS.

[52]  Sebastian Nowozin,et al.  The Numerics of GANs , 2017, NIPS.

[53]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[54]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[55]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[56]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[57]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[60]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[61]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[62]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[63]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[64]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[65]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[66]  Yann Ollivier,et al.  Riemannian metrics for neural networks II: recurrent networks and learning symbolic data sequences , 2013, 1306.0514.

[67]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[68]  John C. Duchi,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .

[69]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[71]  P. Glendinning Stability, Instability and Chaos: An Introduction to the Theory of Nonlinear Differential Equations , 1994 .

[72]  Robert M. May,et al.  Simple mathematical models with very complicated dynamics , 1976, Nature.

[73]  P. Hartman A lemma in the theory of structural stability of differential equations , 1960 .

[74]  James B. Simon,et al.  SGD Can Converge to Local Maxima , 2022, ICLR.

[75]  Lei Wu,et al.  The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin , 2022, ArXiv.

[76]  Joan Bruna,et al.  On Gradient Descent Convergence beyond the Edge of Stability , 2022, ArXiv.

[77]  Karline Soetaert,et al.  Solving Ordinary Differential Equations in R , 2012 .

[78]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[79]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[80]  Harris Drucker,et al.  Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .

[81]  H. H. Rosenbrock,et al.  An Automatic Method for Finding the Greatest or Least Value of a Function , 1960, Comput. J..