Capacity and Trainability in Recurrent Neural Networks

Two potential bottlenecks on the expressiveness of recurrent neural networks (RNNs) are their ability to store information about the task in their parameters, and to store information about the input history in their units. We show experimentally that all common RNN architectures achieve nearly the same per-task and per-unit capacity bounds with careful training, for a variety of tasks and stacking depths. They can store an amount of task information which is linear in the number of parameters, and is approximately 5 bits per parameter. They can additionally store approximately one real number from their input history per hidden unit. We further find that for several tasks it is the per-task parameter capacity bound that determines performance. These results suggest that many previous results comparing RNN architectures are driven primarily by differences in training effectiveness, rather than differences in capacity. Supporting this observation, we compare training difficulty for several architectures, and show that vanilla RNNs are far more difficult to train, yet have slightly higher capacity. Finally, we propose two novel RNN architectures, one of which is easier to train than the LSTM or GRU for deeply stacked architectures.

[1]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[2]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[3]  Harald Haas,et al.  Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[4]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[5]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[6]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[7]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[8]  W. Newsome,et al.  Context-dependent computation by recurrent dynamics in prefrontal cortex , 2013, Nature.

[9]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[10]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .

[11]  Surya Ganguli,et al.  Memory traces in dynamical systems , 2008, Proceedings of the National Academy of Sciences.

[12]  Jianxin Wu,et al.  Minimal gated unit for recurrent neural networks , 2016, International Journal of Automation and Computing.

[13]  SchmidhuberJürgen,et al.  A Novel Connectionist System for Unconstrained Handwriting Recognition , 2009 .

[14]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Eduardo D. Sontag,et al.  Vapnik-Chervonenkis Dimension of Recurrent Neural Networks , 1997, Discret. Appl. Math..

[19]  Eduardo D. Sontag,et al.  Vapnik-Chervonenkis Dimension of Recurrent Neural Networks , 1998, Discret. Appl. Math..

[20]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[21]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[23]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[24]  Thomas M. Cover,et al.  Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition , 1965, IEEE Trans. Electron. Comput..

[25]  David Sussillo,et al.  Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks , 2013, Neural Computation.

[26]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[27]  Kenji Doya,et al.  Universality of Fully-Connected Recurrent Neural Networks , 1993 .

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Henry Markram,et al.  Real-Time Computing Without Stable States: A New Framework for Neural Computation Based on Perturbations , 2002, Neural Computation.

[30]  Baldi,et al.  Number of stable points for spin-glasses and neural networks of higher orders. , 1987, Physical review letters.

[31]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[32]  E. Gardner The space of interactions in neural network models , 1988 .

[33]  Le Song,et al.  Deep Fried Convnets , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Jascha Sohl-Dickstein,et al.  Intelligible Language Modeling with Input Switched Affine Networks , 2016, ArXiv.

[35]  T. Sejnowski,et al.  Nanoconnectomic upper bound on the variability of synaptic plasticity , 2015, eLife.

[36]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[37]  Leonidas J. Guibas,et al.  Deep Knowledge Tracing , 2015, NIPS.

[38]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[39]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[40]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[41]  Christopher J. Rozell,et al.  Short-Term Memory Capacity in Networks via the Restricted Isometry Property , 2013, Neural Computation.

[42]  Haim Sompolinsky,et al.  Short-term memory in orthogonal neural networks. , 2004, Physical review letters.