论文信息 - Revisiting NARX Recurrent Neural Networks for Long-Term Dependencies - 字舞流文

Revisiting NARX Recurrent Neural Networks for Long-Term Dependencies

Recurrent neural networks (RNNs) have shown success for many sequence-modeling tasks, but learning long-term dependencies from data remains difficult. This is often attributed to the vanishing gradient problem, which shows that gradient components relating a loss at time t to time t− τ tend to decay exponentially with τ . Long short-term memory (LSTM) and gated recurrent units (GRUs), the most widely-used RNN architectures, attempt to remedy this problem by making the decay’s base closer to 1. NARX RNNs1 take an orthogonal approach: by including direct connections, or delays, from the past, NARX RNNs make the decay’s exponent closer to 0. However, as introduced, NARX RNNs reduce the decay’s exponent only by a factor of nd, the number of delays, and simultaneously increase computation by this same factor. We introduce a new variant of NARX RNNs, called MIxed hiSTory RNNs, which addresses these drawbacks. We show that for τ ≤ 2nd−1, MIST RNNs reduce the decay’s worst-case exponent from τ/nd to log τ , while maintaining computational complexity that is similar to LSTM and GRUs. We compare MIST RNNs to simple RNNs, LSTM, and GRUs across 4 diverse tasks. MIST RNNs outperform all other methods in 2 cases, and in all cases are competitive.

Nassir Navab | Gregory D. Hager | Robert S. DiPietro | Gregory Hager | Nassir Navab | R. DiPietro

[1] Andrew K. Halberstadt. Heterogeneous acoustic measurements and multiple classifiers for speech recognition , 1999 .

[2] Jonathan G. Fiscus,et al. DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[3] Tobias Scheffer,et al. International Conference on Machine Learning (ICML-99) , 1999, Künstliche Intell..

[4] Yoshua Bengio,et al. Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[5] Jürgen Schmidhuber,et al. A Clockwork RNN , 2014, ICML.

[6] Jürgen Schmidhuber,et al. Learning to forget: continual prediction with LSTM , 1999 .

[7] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[8] Jürgen Schmidhuber,et al. LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[9] Zoubin Ghahramani,et al. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[10] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[11] Jason Weston,et al. End-To-End Memory Networks , 2015, NIPS.

[12] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[13] Yoshua Bengio,et al. Gradient-based Learning Applied to Document Recognition Gt Graph Transformer. Gtn Graph Transformer Network. Hmm Hidden Markov Model. Hos Heuristic Oversegmentation. K-nn K-nearest Neighbor. Nn Neural Network. Ocr Optical Character Recognition. Pca Principal Component Analysis. Rbf Radial Basis Func , 1998 .

[14] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[15] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16] Aaron C. Courville,et al. Recurrent Batch Normalization , 2016, ICLR.

[17] Tapani Raiko,et al. International Conference on Learning Representations (ICLR) , 2016 .

[18] Yoshua Bengio,et al. Hierarchical Recurrent Neural Networks for Long-Term Dependencies , 1995, NIPS.

[19] Alex Graves,et al. Associative Long Short-Term Memory , 2016, ICML.

[20] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[21] Andreas Krause,et al. Advances in Neural Information Processing Systems (NIPS) , 2014 .

[22] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[23] Geoffrey E. Hinton,et al. A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[24] Koray Kavukcuoglu,et al. Pixel Recurrent Neural Networks , 2016, ICML.

[25] Peter Tiño,et al. Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[26] Ilya Sutskever,et al. Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[27] Yann LeCun,et al. Orthogonal RNNs and Long-Memory Tasks , 2016, ArXiv.

[28] PAUL J. WERBOS,et al. Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[29] Tony Plate,et al. Holographic Recurrent Networks , 1992, NIPS.

[30] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[31] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[32] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[33] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[34] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[35] Paul J. Werbos,et al. Maximizing long-term gas industry profits in two minutes in Lotus using neural network methods , 1989, IEEE Trans. Syst. Man Cybern..

[36] Yoshua Bengio,et al. Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[37] P J Webros. BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[38] Hsiao-Wuen Hon,et al. Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[39] Yoshua Bengio,et al. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[40] Wojciech Zaremba,et al. An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.