论文信息 - How does momentum benefit deep neural networks architecture design? A few case studies

How does momentum benefit deep neural networks architecture design? A few case studies

We present and review an algorithmic and theoretical framework for improving neural network architecture design via momentum. As case studies, we consider how momentum can improve the architecture design for recurrent neural networks (RNNs), neural ordinary differential equations (ODEs), and transformers. We show that integrating momentum into neural network architectures has several remarkable theoretical and empirical benefits, including 1) integrating momentum into RNNs and neural ODEs can overcome the vanishing gradient issues in training RNNs and neural ODEs, resulting in effective learning long-term dependencies. 2) momentum in neural ODEs can reduce the stiffness of the ODE dynamics, which significantly enhances the computational efficiency in training and testing. 3) momentum can improve the efficiency and accuracy of transformers.

[1] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .

[2] M. L. Chambers. The Mathematical Theory of Optimal Processes , 1965 .

[3] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[4] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[5] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[9] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[10] Carla Teixeira Lopes,et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[11] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[13] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Yoshua Bengio,et al. Gated Feedback Recurrent Neural Networks , 2015, ICML.

[17] Geoffrey E. Hinton,et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[18] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Philipp Koehn,et al. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[21] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[22] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[23] Les E. Atlas,et al. Full-Capacity Unitary Recurrent Neural Networks , 2016, NIPS.