Why are Adaptive Methods Good for Attention Models?
暂无分享,去创建一个
Sashank J. Reddi | Sai Praneeth Karimireddy | S. Sra | Sanjiv Kumar | Andreas Veit | Seungyeon Kim | J. Zhang
[1] Eduard A. Gorbunov,et al. Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping , 2020, NeurIPS.
[2] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[3] Yee Whye Teh,et al. Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise , 2020, ICML.
[4] Ashok Cutkosky,et al. Momentum Improves Normalized SGD , 2020, ICML.
[5] John C. Duchi,et al. Lower bounds for non-convex stochastic optimization , 2019, Mathematical Programming.
[6] Praneeth Netrapalli,et al. Non-Gaussianity of Stochastic Gradient Noise , 2019, ArXiv.
[7] Denis Yarats,et al. On the adequacy of untuned warmup for adaptive optimization , 2019, AAAI.
[8] Jianfeng Gao,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[9] Gaël Richard,et al. First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise , 2019, NeurIPS.
[10] Suvrit Sra,et al. Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.
[11] Sanjiv Kumar,et al. Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.
[12] Levent Sagun,et al. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.
[13] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[14] Li Shen,et al. A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Yong Yu,et al. AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.
[16] Yuan Cao,et al. On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.
[17] Li Shen,et al. On the Convergence of Weighted AdaGrad with Momentum for Training Deep Neural Networks , 2018 .
[18] Ruoyu Sun,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[19] Yi Zhang,et al. The Case for Full-Matrix Adaptive Regularization , 2018, ArXiv.
[20] Xiaoxia Wu,et al. L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .
[21] Francesco Orabona,et al. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.
[22] Bin Dong,et al. Nostalgic Adam: Weighting More of the Past Gradients When Designing the Adaptive Learning Rate , 2018, IJCAI.
[23] Sashank J. Reddi,et al. On the Convergence of Adam and Beyond , 2018, ICLR.
[24] Yair Carmon,et al. Lower bounds for finding stationary points I , 2017, Mathematical Programming.
[25] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[26] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[27] Kfir Y. Levy,et al. The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.
[28] Shai Shalev-Shwartz,et al. Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.
[29] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[30] Nicolò Cesa-Bianchi,et al. Bandits With Heavy Tail , 2012, IEEE Transactions on Information Theory.
[31] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.
[32] John C. Duchi,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .
[33] H. Robbins. A Stochastic Approximation Method , 1951 .
[34] L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .