How does momentum benefit deep neural networks architecture design? A few case studies

We present and review an algorithmic and theoretical framework for improving neural network architecture design via momentum. As case studies, we consider how momentum can improve the architecture design for recurrent neural networks (RNNs), neural ordinary differential equations (ODEs), and transformers. We show that integrating momentum into neural network architectures has several remarkable theoretical and empirical benefits, including 1) integrating momentum into RNNs and neural ODEs can overcome the vanishing gradient issues in training RNNs and neural ODEs, resulting in effective learning long-term dependencies. 2) momentum in neural ODEs can reduce the stiffness of the ODE dynamics, which significantly enhances the computational efficiency in training and testing. 3) momentum can improve the efficiency and accuracy of transformers.

[1]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[2]  M. L. Chambers The Mathematical Theory of Optimal Processes , 1965 .

[3]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[4]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[5]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[9]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[10]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[11]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[13]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.

[17]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[18]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[21]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[22]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[23]  Les E. Atlas,et al.  Full-Capacity Unitary Recurrent Neural Networks , 2016, NIPS.

[24]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[25]  Yann LeCun,et al.  Recurrent Orthogonal Networks and Long-Memory Tasks , 2016, ICML.

[26]  Alexander M. Rush,et al.  Structured Attention Networks , 2017, ICLR.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Jean-Philippe Noël,et al.  F-16 aircraft benchmark based on ground vibration test data , 2017 .

[29]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[30]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[31]  James Bailey,et al.  Efficient Orthogonal Parametrisation of Recurrent Neural Networks Using Householder Reflections , 2016, ICML.

[32]  Christopher Joseph Pal,et al.  On orthogonality and learning recurrent networks with long term dependencies , 2017, ICML.

[33]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[34]  Joan Lasenby,et al.  The unreasonable effectiveness of the forget gate , 2018, ArXiv.

[35]  Guy Blanc,et al.  Adaptive Sampled Softmax with Kernel Based Sampling , 2017, ICML.

[36]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[37]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[38]  Will Grathwohl Scalable Reversible Generative Models with Free-form Continuous Dynamics , 2018 .

[39]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[40]  一樹 美添,et al.  5分で分かる! ? 有名論文ナナメ読み:Silver, D. et al. : Mastering the Game of Go without Human Knowledge , 2018 .

[41]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[42]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[43]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[44]  Debdeep Mukhopadhyay,et al.  Adversarial Attacks and Defences: A Survey , 2018, ArXiv.

[45]  Huan Li,et al.  Optimization Algorithm Inspired Deep Neural Network Structure Design , 2018, ACML.

[46]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[47]  Zhen Li,et al.  Deep Neural Nets with Interpolating Function as Output Activation , 2018, NeurIPS.

[48]  Qiang Ye,et al.  Orthogonal Recurrent Neural Networks with Scaled Cayley Transform , 2017, ICML.

[49]  Quoc V. Le,et al.  The Evolved Transformer , 2019, ICML.

[50]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[51]  Sampled Softmax with Random Fourier Features , 2019, NeurIPS.

[52]  Yoshua Bengio,et al.  Towards Non-saturating Recurrent Units for Modelling Long-term Dependencies , 2019, AAAI.

[53]  Yee Whye Teh,et al.  Augmented Neural ODEs , 2019, NeurIPS.

[54]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[55]  Mario Lezcano-Casado,et al.  Trivializations for Gradient-Based Optimization on Manifolds , 2019, NeurIPS.

[56]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[57]  Stanley Osher,et al.  ResNets Ensemble via the Feynman-Kac Formalism to Improve Natural and Robust Accuracies , 2018, NeurIPS.

[58]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[59]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[60]  Mario Lezcano Casado,et al.  Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group , 2019, ICML.

[61]  David Duvenaud,et al.  Latent Ordinary Differential Equations for Irregularly-Sampled Time Series , 2019, NeurIPS.

[62]  Kurt Keutzer,et al.  ANODEV2: A Coupled Neural ODE Framework , 2019, NeurIPS.

[63]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[64]  Andrew M. Dai,et al.  Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.

[65]  Adversarial defense via the data-dependent activation, total variation minimization, and adversarial training , 2020 .

[66]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[67]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[68]  Alexandr Katrutsa,et al.  Interpolation Technique to Speed Up Gradients Propagation in Neural ODEs , 2020, NeurIPS.

[69]  Richard G. Baraniuk,et al.  MomentumRNN: Integrating Momentum into Recurrent Neural Networks , 2020, NeurIPS.

[70]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2019, FINDINGS.

[71]  Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent , 2020, ArXiv.

[72]  Terry Lyons,et al.  Neural Controlled Differential Equations for Irregular Time Series , 2020, NeurIPS.

[73]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[74]  Matthew J. Johnson,et al.  Learning Differential Equations that are Easy to Solve , 2020, NeurIPS.

[75]  Qiang Ye,et al.  Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum , 2020, ArXiv.

[76]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[77]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[78]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[79]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[80]  Jonathan Masci,et al.  SNODE: Spectral Discretization of Neural ODEs for System Identification , 2019, ICLR.

[81]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[82]  Finale Doshi-Velez,et al.  Model-based Reinforcement Learning for Semi-Markov Decision Processes with Neural ODEs , 2020, NeurIPS.

[83]  Adam M. Oberman,et al.  How to Train Your Neural ODE: the World of Jacobian and Kinetic Regularization , 2020, ICML.

[84]  A. Yamashita,et al.  Dissecting Neural ODEs , 2020, Neural Information Processing Systems.

[85]  Franccois Fleuret,et al.  Fast Transformers with Clustered Attention , 2020, NeurIPS.

[86]  Nikola Simidjievski,et al.  On Second Order Behaviour in Augmented Neural ODEs , 2020, NeurIPS.

[87]  Stanley J. Osher,et al.  Graph Interpolating Activation Improves Both Natural and Robust Accuracies in Data-Efficient Deep Learning , 2019, ArXiv.

[88]  Mathieu Blondel,et al.  Momentum Residual Neural Networks , 2021, ICML.

[89]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[90]  Kazuki Irie,et al.  Linear Transformers Are Secretly Fast Weight Memory Systems , 2021, ArXiv.

[91]  Stanley J. Osher,et al.  FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention , 2021, NeurIPS.

[92]  Rethinking Attention with Performers , 2020, ICLR.

[93]  Andrea L. Bertozzi,et al.  Heavy Ball Neural Ordinary Differential Equations , 2021, NeurIPS.

[94]  Sekhar Tatikonda,et al.  MALI: A memory efficient and reverse accurate integrator for Neural ODEs , 2021, ICLR.

[95]  Shuai Yi,et al.  Efficient Attention: Attention with Linear Complexities , 2018, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[96]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[97]  Glenn M. Fung,et al.  Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention , 2021, AAAI.

[98]  Yi Tay,et al.  Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[99]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[100]  Viral B. Shah,et al.  Opening the Blackbox: Accelerating Neural Differential Equations by Regularizing Internal Solver Heuristics , 2021, ICML.

[101]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[102]  Il-Chul Moon,et al.  Implicit Kernel Attention , 2020, AAAI.

[103]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[104]  Dongsheng Li,et al.  Training Deep Neural Networks with Adaptive Momentum Inspired by the Quadratic Optimization , 2021, ArXiv.