Revisiting NARX Recurrent Neural Networks for Long-Term Dependencies

Recurrent neural networks (RNNs) have shown success for many sequence-modeling tasks, but learning long-term dependencies from data remains difficult. This is often attributed to the vanishing gradient problem, which shows that gradient components relating a loss at time t to time t− τ tend to decay exponentially with τ . Long short-term memory (LSTM) and gated recurrent units (GRUs), the most widely-used RNN architectures, attempt to remedy this problem by making the decay’s base closer to 1. NARX RNNs1 take an orthogonal approach: by including direct connections, or delays, from the past, NARX RNNs make the decay’s exponent closer to 0. However, as introduced, NARX RNNs reduce the decay’s exponent only by a factor of nd, the number of delays, and simultaneously increase computation by this same factor. We introduce a new variant of NARX RNNs, called MIxed hiSTory RNNs, which addresses these drawbacks. We show that for τ ≤ 2nd−1, MIST RNNs reduce the decay’s worst-case exponent from τ/nd to log τ , while maintaining computational complexity that is similar to LSTM and GRUs. We compare MIST RNNs to simple RNNs, LSTM, and GRUs across 4 diverse tasks. MIST RNNs outperform all other methods in 2 cases, and in all cases are competitive.

[1]  Andrew K. Halberstadt Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[2]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[3]  Tobias Scheffer,et al.  International Conference on Machine Learning (ICML-99) , 1999, Künstliche Intell..

[4]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[5]  Jürgen Schmidhuber,et al.  A Clockwork RNN , 2014, ICML.

[6]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[7]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[10]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[12]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[13]  Yoshua Bengio,et al.  Gradient-based Learning Applied to Document Recognition Gt Graph Transformer. Gtn Graph Transformer Network. Hmm Hidden Markov Model. Hos Heuristic Oversegmentation. K-nn K-nearest Neighbor. Nn Neural Network. Ocr Optical Character Recognition. Pca Principal Component Analysis. Rbf Radial Basis Func , 1998 .

[14]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[17]  Tapani Raiko,et al.  International Conference on Learning Representations (ICLR) , 2016 .

[18]  Yoshua Bengio,et al.  Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[19]  Alex Graves,et al.  Associative Long Short-Term Memory , 2016, ICML.

[20]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[21]  Andreas Krause,et al.  Advances in Neural Information Processing Systems (NIPS) , 2014 .

[22]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[23]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[24]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[25]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[26]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[27]  Yann LeCun,et al.  Orthogonal RNNs and Long-Memory Tasks , 2016, ArXiv.

[28]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[29]  Tony Plate,et al.  Holographic Recurrent Networks , 1992, NIPS.

[30]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[31]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[32]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[33]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[34]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[35]  Paul J. Werbos,et al.  Maximizing long-term gas industry profits in two minutes in Lotus using neural network methods , 1989, IEEE Trans. Syst. Man Cybern..

[36]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[37]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[38]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[39]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[40]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.