When Recurrent Models Don't Need To Be Recurrent

We prove stable recurrent neural networks are well approximated by feed-forward networks for the purpose of both inference and training by gradient descent. Our result applies to a broad range of non-linear recurrent neural networks under a natural stability condition, which we observe is also necessary. Complementing our theoretical findings, we verify the conclusions of our theory on both real and synthetic tasks. Furthermore, we demonstrate recurrent models satisfying the stability assumption of our theory can have excellent performance on real sequence learning tasks.

[1]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[2]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[3]  P. J. Price,et al.  Evaluation of Spoken Language Systems: the ATIS Domain , 1990, HLT.

[4]  Christopher Joseph Pal,et al.  On orthogonality and learning recurrent networks with long term dependencies , 2017, ICML.

[5]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[6]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[9]  Anima Anandkumar,et al.  Training Input-Output Recurrent Neural Networks through Spectral Methods , 2016, ArXiv.

[10]  Les E. Atlas,et al.  Full-Capacity Unitary Recurrent Neural Networks , 2016, NIPS.

[11]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[12]  Benjamin Recht,et al.  Non-Asymptotic Analysis of Robust Control from Coarse-Grained Identification , 2017, ArXiv.

[13]  Christopher K. I. Williams,et al.  Harmonising Chorales by Probabilistic Inference , 2004, NIPS.

[14]  Moustapha Cissé,et al.  Kronecker Recurrent Units , 2017, ICML.

[15]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[16]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[17]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[18]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[19]  Samet Oymak,et al.  Stochastic Gradient Descent Learns State Equations with Nonlinear Activations , 2018, COLT.

[20]  Tengyu Ma,et al.  Gradient Descent Learns Linear Dynamical Systems , 2016, J. Mach. Learn. Res..

[21]  Prateek Jain,et al.  FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network , 2018, NeurIPS.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Liang Jin,et al.  Absolute stability conditions for discrete-time recurrent neural networks , 1994, IEEE Trans. Neural Networks.

[24]  Thomas Laurent,et al.  A recurrent neural network without chaos , 2016, ICLR.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Inderjit S. Dhillon,et al.  Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization , 2018, ICML.