A Theory on Adam Instability in Large-Scale Machine Learning
暂无分享,去创建一个
Punit Singh Koura | Naman Goyal | Zach DeVito | Sharan Narang | Susan Zhang | Moya Chen | Stephen Roller | Yuchen Zhang | Igor Molybog | Binh Tang | Andrew Poulton | Melanie Kambadur | Puxin Xu | Diana Liskovich | Peter Albert | David Esiobu | Ruan Silva
[1] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[2] D. Klabjan,et al. Divergence Results and Convergence of a Variance Reduced Version of ADAM , 2022, ArXiv.
[3] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[4] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[5] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.
[6] Jianlin Su,et al. RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, Neurocomputing.
[7] Michael W. Mahoney,et al. Hessian Eigenspectra of More Realistic Nonlinear Models , 2021, NeurIPS.
[8] Vineeth N Balasubramanian,et al. A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization , 2020, AAAI.
[9] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[10] Noam Shazeer,et al. GLU Variants Improve Transformer , 2020, ArXiv.
[11] Michael W. Mahoney,et al. PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).
[12] Li Shen,et al. A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Yong Yu,et al. AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.
[14] Ruoyu Sun,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[15] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.
[16] Sashank J. Reddi,et al. On the Convergence of Adam and Beyond , 2018, ICLR.
[17] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[20] Razvan Pascanu,et al. Understanding the exploding gradient problem , 2012, ArXiv.
[21] J. Hartigan,et al. The Dip Test of Unimodality , 1985 .
[22] J. Kiefer,et al. Stochastic Estimation of the Maximum of a Regression Function , 1952 .
[23] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[24] Sanjiv Kumar,et al. Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.
[25] J. Norris. Appendix: probability and measure , 1997 .
[26] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .
[27] Yi-Kai Liu. Statistical Behavior of the Eigenvalues of Random Matrices , 2001 .