On Convergence of Training Loss Without Reaching Stationary Points

It is a well-known fact that nonconvex optimization is computationally intractable in the worst case. As a result, theoretical analysis of optimization algorithms such as gradient descent often focuses on local convergence to stationary points where the gradient norm is zero or negligible. In this work, we examine the disconnect between the existing theoretical analysis of gradient-based algorithms and actual practice. Specifically, we provide numerical evidence that in large-scale neural network training, such as in ImageNet, ResNet, and WT103 + TransformerXL models, the Neural Network weight variables do not converge to stationary points where the gradient of the loss function vanishes. Remarkably, however, we observe that while weights do not converge to stationary points, the value of the loss function converges. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems. We prove convergence of the distribution of weight values to an approximate invariant measure (without smoothness and assumptions) that explains how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align the theory with empirical observations.

[1]  Alistair Letcher,et al.  On the Impossibility of Global Convergence in Multi-Loss Optimization , 2020, ICLR.

[2]  Umut Simsekli,et al.  The Heavy-Tail Phenomenon in SGD , 2020, ArXiv.

[3]  Ross Wightman,et al.  ResNet strikes back: An improved training procedure in timm , 2021, ArXiv.

[4]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[5]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[6]  Ankit Singh Rawat,et al.  On the Reproducibility of Neural Network Predictions , 2021, ArXiv.

[7]  Surya Ganguli,et al.  The Limiting Dynamics of SGD: Modified Loss, Phase-Space Oscillations, and Anomalous Diffusion , 2021, Neural Comput..

[8]  Suvrit Sra,et al.  Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.

[9]  Pranava Madhyastha,et al.  On Model Stability as a Function of Random Seed , 2019, CoNLL.

[10]  Sanjeev Arora,et al.  Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate , 2020, NeurIPS.

[11]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[12]  Lei Wu How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .

[13]  Andrey Malinin,et al.  On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay , 2021, NeurIPS.

[14]  Yun Kuen Cheung,et al.  Vortices Instead of Equilibria in MinMax Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games , 2019, COLT.

[15]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[16]  Georgios Piliouras,et al.  No-regret learning and mixed Nash equilibria: They do not mix , 2020, NeurIPS.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Georgios Piliouras,et al.  Game dynamics as the meaning of a game , 2019, SECO.

[19]  Ameet Talwalkar,et al.  Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , 2021, ICLR.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Panayotis Mertikopoulos,et al.  On the convergence of single-call stochastic extra-gradient methods , 2019, NeurIPS.

[22]  Michael I. Jordan,et al.  Stochastic Gradient and Langevin Processes , 2019, ICML.

[23]  A. Jadbabaie,et al.  On Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions , 2020, arXiv.org.