A Theory on Adam Instability in Large-Scale Machine Learning

We present a theory for the previously unexplained divergent behavior noticed in the training of large language models. We argue that the phenomenon is an artifact of the dominant optimization algorithm used for training, called Adam. We observe that Adam can enter a state in which the parameter update vector has a relatively large norm and is essentially uncorrelated with the direction of descent on the training loss landscape, leading to divergence. This artifact is more likely to be observed in the training of a deep model with a large batch size, which is the typical setting of large-scale language model training. To argue the theory, we present observations from the training runs of the language models of different scales: 7 billion, 30 billion, 65 billion, and 546 billion parameters.

[1]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[2]  D. Klabjan,et al.  Divergence Results and Convergence of a Variance Reduced Version of ADAM , 2022, ArXiv.

[3]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[4]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[5]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[6]  Jianlin Su,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.

[7]  Michael W. Mahoney,et al.  Hessian Eigenspectra of More Realistic Nonlinear Models , 2021, NeurIPS.

[8]  Vineeth N Balasubramanian,et al.  A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization , 2020, AAAI.

[9]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[10]  Noam Shazeer,et al.  GLU Variants Improve Transformer , 2020, ArXiv.

[11]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[12]  Li Shen,et al.  A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yong Yu,et al.  AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.

[14]  Ruoyu Sun,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[15]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[16]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[17]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Razvan Pascanu,et al.  Understanding the exploding gradient problem , 2012, ArXiv.

[21]  J. Hartigan,et al.  The Dip Test of Unimodality , 1985 .

[22]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[23]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[24]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[25]  J. Norris Appendix: probability and measure , 1997 .

[26]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[27]  Yi-Kai Liu Statistical Behavior of the Eigenvalues of Random Matrices , 2001 .