Learning long-term dependencies with gradient descent is difficult

Recurrent neural networks can be used to map input sequences to output sequences, such as for recognition, production or prediction problems. However, practical difficulties have been reported in training recurrent neural networks to perform tasks in which the temporal contingencies present in the input/output sequences span long intervals. We show why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases. These results expose a trade-off between efficient learning by gradient descent and latching on information for long periods. Based on an understanding of this problem, alternatives to standard gradient descent are considered.

[1]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[2]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[3]  Y. L. Cun Learning Process in an Asymmetric Threshold Network , 1986 .

[4]  Yann LeCun,et al.  Learning processes in an asymmetric threshold network , 1986 .

[5]  Sandro Ridella,et al.  Minimizing multimodal functions of continuous variables with the “simulated annealing” algorithmCorrigenda for this article is available here , 1987, TOMS.

[6]  Eytan Domany,et al.  Learning by Choice of Internal Representations , 1988, Complex Syst..

[7]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[8]  M. Gori,et al.  BPS: a learning algorithm for capturing the dynamic nature of speech , 1989, International 1989 Joint Conference on Neural Networks.

[9]  Michael C. Mozer,et al.  A Focused Backpropagation Algorithm for Temporal Pattern Recognition , 1989, Complex Syst..

[10]  Richard Rohwer,et al.  The "Moving Targets" Training Algorithm , 1989, NIPS.

[11]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[12]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[13]  Yoshua Bengio,et al.  Artificial neural networks and their application to sequence recognition , 1991 .

[14]  Michael C. Mozer,et al.  Induction of Multiscale Temporal Structure , 1991, NIPS.

[15]  Charles M. Marcus,et al.  Nonlinear dynamics and stability of analog neural networks , 1991 .

[16]  Giovanni Soda,et al.  Local Feedback Multilayered Networks , 1992, Neural Computation.

[17]  C. L. Giles,et al.  Inserting rules into recurrent neural networks , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[18]  Peter L. Bartlett,et al.  Using random weights to train multilayer networks of hard-limiting units , 1992, IEEE Trans. Neural Networks.

[19]  Yoshua Bengio,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[20]  Yoshua Bengio,et al.  The problem of learning long-term dependencies in recurrent networks , 1993, IEEE International Conference on Neural Networks.

[21]  R. J. Gaynier,et al.  A method of training multi-layer networks with heaviside characteristics using internal representations , 1993, IEEE International Conference on Neural Networks.

[22]  Giovanni Soda,et al.  Unified Integration of Explicit Knowledge and Learning by Example in Recurrent Networks , 1995, IEEE Trans. Knowl. Data Eng..