Implicit Gradient Regularization

Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.

[1]  E. Hairer,et al.  Geometric Numerical Integration , 2022, Oberwolfach Reports.

[2]  Daniel A. Roberts SGD Implicitly Regularizes Generalization Error , 2021, ArXiv.

[3]  Jeff Donahue,et al.  Training Generative Adversarial Networks by Solving Ordinary Differential Equations , 2020, NeurIPS.

[4]  E Weinan,et al.  The Quenching-Activation Behavior of the Gradient Descent Dynamics for Two-layer Neural Network Models , 2020, ArXiv.

[5]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[6]  Michael I. Jordan,et al.  On dissipative symplectic integration with applications to gradient-based optimization , 2020, Journal of Statistical Mechanics: Theory and Experiment.

[7]  Edgar Dobriban,et al.  The Implicit Regularization of Stochastic Gradient Flow for Least Squares , 2020, ICML.

[8]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[9]  Mikhail Belkin,et al.  Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning , 2020, ArXiv.

[10]  Samuel L. Smith,et al.  Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks , 2020, NeurIPS.

[11]  Samuel L. Smith,et al.  Batch Normalization Biases Deep Residual Networks Towards Shallow Paths , 2020, ArXiv.

[12]  Jacek Tabor,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[13]  Nathan Srebro,et al.  Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.

[14]  Zheng Ma,et al.  A type of generalization error induced by initialization in deep neural networks , 2019, MSML.

[15]  Jian-Guo Liu,et al.  Uniform-in-Time Weak Error Analysis for Stochastic Gradient Descent Algorithms via Diffusion Approximation , 2019, Communications in Mathematical Sciences.

[16]  Levent Sagun,et al.  Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[17]  Tomaso A. Poggio,et al.  Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization , 2019, ArXiv.

[18]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[19]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[20]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[21]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[22]  Quoc V. Le,et al.  The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study , 2019, ICML.

[23]  Francis Bach,et al.  Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.

[24]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[25]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[26]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[27]  J. Zico Kolter,et al.  Generalization in Deep Networks: The Role of Distance from Initialization , 2019, ArXiv.

[28]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[29]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[30]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[31]  J. Zico Kolter,et al.  A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.

[32]  Nathan Srebro,et al.  Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate , 2018, AISTATS.

[33]  Shun-ichi Amari,et al.  Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.

[34]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[35]  Gianluca Francini,et al.  Learning Sparse Neural Networks via Sensitivity-Driven Regularization , 2018, NeurIPS.

[36]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[37]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[38]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[39]  Aryan Mokhtari,et al.  Direct Runge-Kutta Discretization Achieves Acceleration , 2018, NeurIPS.

[40]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[41]  Thore Graepel,et al.  The Mechanics of n-Player Differentiable Games , 2018, ICML.

[42]  Matthew Botvinick,et al.  On the importance of single directions for generalization , 2018, ICLR.

[43]  Michael I. Jordan,et al.  On Symplectic Optimization , 2018, 1802.03653.

[44]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[45]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[46]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[47]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[48]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[49]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[50]  Pradeep Ravikumar,et al.  Connecting Optimization and Regularization Paths , 2018, NeurIPS.

[51]  J. Zico Kolter,et al.  Gradient descent GAN optimization is locally stable , 2017, NIPS.

[52]  Sebastian Nowozin,et al.  The Numerics of GANs , 2017, NIPS.

[53]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[54]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[55]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[56]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[57]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[58]  Alexandre d'Aspremont,et al.  Integration Methods and Optimization Algorithms , 2017, NIPS.

[59]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[60]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[61]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[62]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[63]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[64]  Ernst Hairer,et al.  The life-span of backward error analysis for numerical integrators , 1997 .

[65]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[66]  Y. Le Cun,et al.  Double backpropagation increasing generalization performance , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[67]  J. H. Wilkinson Error analysis of floating-point computation , 1960 .