论文信息 - LaProp: a Better Way to Combine Momentum with Adaptive Gradient - 字舞流文

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

Identifying a divergence problem in Adam, we propose a new optimizer, LaProp, which belongs to the family of adaptive gradient descent methods. This method allows for greater flexibility in choosing its hyperparameters, mitigates the effort of fine tuning, and permits straightforward interpolation between the signed gradient methods and the adaptive gradient methods. We bound the regret of LaProp on a convex problem and show that our bound differs from the previous methods by a key factor, which demonstrates its advantage. We experimentally show that LaProp outperforms the previous methods on a toy task with noisy gradients, optimization of extremely deep fully-connected networks, neural art style transfer, natural language processing using transformers, and reinforcement learning with deep-Q networks. The performance improvement of LaProp is shown to be consistent, sometimes dramatic and qualitative.

Ziyin Liu | Masahito Ueda | Zhikang Wang | Zhikang T. Wang | Masahito Ueda | Liu Ziyin

[1] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[2] David Rolnick,et al. How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[3] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] H. H. Rosenbrock,et al. An Automatic Method for Finding the Greatest or Least Value of a Function , 1960, Comput. J..

[5] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[6] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[7] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[8] Yuan Cao,et al. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[9] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[10] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[12] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[13] Marcello Federico,et al. Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[14] He He,et al. GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[15] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16] Ruoyu Sun,et al. Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[17] Philipp Hennig,et al. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.

[18] Leon A. Gatys,et al. A Neural Algorithm of Artistic Style , 2015, ArXiv.

[19] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .

[20] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[21] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[23] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24] Sebastian Ruder,et al. An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[25] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[26] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[27] Maxime Gazeau,et al. A general system of differential equations to model first order adaptive algorithms , 2018, J. Mach. Learn. Res..

[28] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[29] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[30] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.