Improving the Gating Mechanism of Recurrent Neural Networks

Gating mechanisms are widely used in neural network models, where they allow gradients to backpropagate more easily through depth or time. However, their saturation property introduces problems of its own. For example, in recurrent models these gates need to have outputs near 1 to propagate information over long time-delays, which requires them to operate in their saturation regime and hinders gradient-based learning of the gate mechanism. We address this problem by deriving two synergistic modifications to the standard gating mechanism that are easy to implement, introduce no additional hyperparameters, and improve learnability of the gates when they are close to saturation. We show how these changes are related to and improve on alternative recently proposed gating mechanisms such as chrono initialization and Ordered Neurons. Empirically, our simple gating mechanisms robustly improve the performance of recurrent models on a range of applications, including synthetic memorization tasks, sequential image classification, language modeling, and reinforcement learning, particularly when long-term dependencies are involved.

[1]  Razvan Pascanu,et al.  Relational recurrent neural networks , 2018, NeurIPS.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[4]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[5]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[6]  Aaron C. Courville,et al.  Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks , 2018, ICLR.

[7]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[8]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[9]  Honglak Lee,et al.  Control of Memory, Active Perception, and Action in Minecraft , 2016, ICML.

[10]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[11]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[12]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[13]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[14]  Thomas S. Huang,et al.  Dilated Recurrent Neural Networks , 2017, NIPS.

[15]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[16]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[17]  Yann LeCun,et al.  Recurrent Orthogonal Networks and Long-Memory Tasks , 2016, ICML.

[18]  Yoshua Bengio,et al.  Memory Augmented Neural Networks with Wormhole Connections , 2017, ArXiv.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Joel Z. Leibo,et al.  Unsupervised Predictive Memory in a Goal-Directed Agent , 2018, ArXiv.

[21]  Quoc V. Le,et al.  Learning Longer-term Dependencies in RNNs with Auxiliary Losses , 2018, ICML.

[22]  Peter Dayan,et al.  Fast Parametric Learning with Activation Memorization , 2018, ICML.

[23]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[24]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[25]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[26]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[27]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[28]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[29]  Misha Denil,et al.  Noisy Activation Functions , 2016, ICML.

[30]  Chris Dyer,et al.  The NarrativeQA Reading Comprehension Challenge , 2017, TACL.

[31]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[32]  Yoshua Bengio,et al.  Towards Non-saturating Recurrent Units for Modelling Long-term Dependencies , 2019, AAAI.

[33]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[34]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[35]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[38]  Yann Ollivier,et al.  Can recurrent neural networks warp time? , 2018, ICLR.

[39]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[40]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[41]  Joan Lasenby,et al.  The unreasonable effectiveness of the forget gate , 2018, ArXiv.

[42]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[43]  Stephen Merity,et al.  Single Headed Attention RNN: Stop Thinking With Your Head , 2019, ArXiv.

[44]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[47]  Di He,et al.  Towards Binary-Valued Gates for Robust LSTM Training , 2018, ICML.

[48]  Vladlen Koltun,et al.  Trellis Networks for Sequence Modeling , 2018, ICLR.