Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets

Despite the recent achievements in machine learning, we are still very far from achieving real artificial intelligence. In this paper, we discuss the limitations of standard deep learning approaches and show that some of these limitations can be overcome by learning how to grow the complexity of a model in a structured way. Specifically, we study the simplest sequence prediction problems that are beyond the scope of what is learnable with standard recurrent networks, algorithmically generated sequences which can only be learned by models which have the capacity to count and to memorize sequences. We show that some basic algorithms can be learned from sequential data using a recurrent network associated with a trainable memory.

[1]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[2]  Mark A. Fanty,et al.  Context-free parsing with connectionist networks , 1987 .

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[5]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[6]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[7]  C. Lee Giles,et al.  Using Prior Knowledge in a {NNPDA} to Learn Context-Free Languages , 1992, NIPS.

[8]  Colin Giles,et al.  Learning Context-free Grammars: Capabilities and Limitations of a Recurrent Neural Network with an External Stack Memory (cid:3) , 1992 .

[9]  Michael C. Mozer,et al.  A Connectionist Symbol Manipulator that Discovers the Structure of Context-Free Languages , 1992, NIPS.

[10]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[11]  Padhraic Smyth,et al.  Discrete recurrent neural networks for grammatical inference , 1994, IEEE Trans. Neural Networks.

[12]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[13]  Janet Wiles,et al.  Learning to count without a counter: A case study of dynamics and activation landscapes in recurrent networks , 1995 .

[14]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[15]  Mark Steijvers,et al.  A Recurrent Network that performs a Context-Sensitive Prediction Task , 1996 .

[16]  Matthew W. Crocker,et al.  Mechanisms for Sentence Processing , 1996 .

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Helko Lehmann,et al.  Designing a Counter: Another Case Study of Dynamics and Activation Landscapes in Recurrent Networks , 1997, KI.

[19]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[20]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[21]  Nick Chater,et al.  Toward a connectionist model of recursion in human linguistic performance , 1999, Cogn. Sci..

[22]  Paul Rodríguez,et al.  A Recurrent Neural Network that Learns to Count , 1999, Connect. Sci..

[23]  Whitney Tabor,et al.  Fractal encoding of context‐free grammars in connectionist networks , 2000, Expert Syst. J. Knowl. Eng..

[24]  Janet Wiles,et al.  Context-free and context-sensitive dynamics in recurrent neural networks , 2000, Connect. Sci..

[25]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[26]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[27]  J. Pollack The Induction of Dynamical Recognizers , 1996, Machine Learning.

[28]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[29]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[30]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[31]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[32]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[33]  Luca Maria Gambardella,et al.  High-Performance Neural Networks for Visual Object Classification , 2011, ArXiv.

[34]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[38]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[39]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[40]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[41]  Marc'Aurelio Ranzato,et al.  Learning Longer Memory in Recurrent Neural Networks , 2014, ICLR.

[42]  Yoshua Bengio,et al.  Gated Feedback Recurrent Neural Networks , 2015, ICML.