Learning to Forget: Continual Prediction with LSTM

Long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997) can solve numerous tasks not solvable by previous learning algorithms for recurrent neural networks (RNNs). We identify a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network's internal state could be reset. Without resets, the state may grow indefinitely and eventually cause the network to break down. Our remedy is a novel, adaptive forget gate that enables an LSTM cell to learn to reset itself at appropriate times, thus releasing internal resources. We review illustrative benchmark problems on which standard LSTM outperforms other RNN algorithms. All algorithms (including LSTM) fail to solve continual versions of these problems. LSTM with forget gates, however, easily solves them, and in an elegant way.

[1]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[2]  Geoffrey E. Hinton Learning distributed representations of concepts. , 1989 .

[3]  Jürgen Schmidhuber,et al.  A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks , 1989 .

[4]  Michael C. Mozer,et al.  A Focused Backpropagation Algorithm for Temporal Pattern Recognition , 1989, Complex Syst..

[5]  David Zipser,et al.  Learning Sequential Structure with the Real-Time Recurrent Learning Algorithm , 1991, Int. J. Neural Syst..

[6]  James L. McClelland,et al.  Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[7]  Alexander H. Waibel,et al.  Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[8]  Kenji Doya,et al.  Adaptive neural oscillator using continuous-time back-propagation learning , 1989, Neural Networks.

[9]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[10]  Michael C. Mozer,et al.  Connectionist Music Composition Based on Melodic and Stylistic Constraints , 1990, NIPS.

[11]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[12]  Scott E. Fahlman,et al.  The Recurrent Cascade-Correlation Architecture , 1990, NIPS.

[13]  Michael I. Jordan Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .

[14]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[15]  Alexander H. Waibel,et al.  Multi-State Time Delay Networks for Continuous Speech Recognition , 1991, NIPS.

[16]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[17]  Jürgen Schmidhuber,et al.  A Fixed Size Storage O(n3) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks , 1992, Neural Computation.

[18]  Ah Chung Tsoi,et al.  Locally recurrent globally feedforward networks: a critical review of architectures , 1994, IEEE Trans. Neural Networks.

[19]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[20]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[21]  Eric Mjolsness,et al.  A Mulitscale Attentional Framework for Relaxation Neural Networks , 1995, NIPS.

[22]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[23]  Barak A. Pearlmutter Gradient calculations for dynamic recurrent neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[24]  Peter Tiño,et al.  Learning long-term dependencies in NARX recurrent neural networks , 1996, IEEE Trans. Neural Networks.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Christian J. Darken,et al.  Stochastic approximation and neural network learning , 1998 .

[27]  Fred Cummins,et al.  Automatic discrimination among languages based on prosody alone , 1999 .

[28]  Jürgen Schmidhuber,et al.  Language identification from prosody without explicit features , 1999, EUROSPEECH.

[29]  Jonathan D. Cohen,et al.  A Biologically Based Computational Model of Working Memory , 1999 .

[30]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[31]  Fred Cummins,et al.  Learning to Forget: Continual Prediction with Lstm Learning to Forget: Continual Prediction with Lstm , 1999 .

[32]  Gavin C. Cawley,et al.  On a Fast, Compact Approximation of the Exponential Function , 2000, Neural Computation.

[33]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.