Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies

The success of long short-term memory (LSTM) neural networks in language processing is typically attributed to their ability to capture long-distance statistical regularities. Linguistic regularities are often sensitive to syntactic structure; can such dependencies be captured by LSTMs, which do not have explicit structural representations? We begin addressing this question using number agreement in English subject-verb dependencies. We probe the architecture’s grammatical competence both using training objectives with an explicit grammatical target (number prediction, grammaticality judgments) and using language models. In the strongly supervised settings, the LSTM achieved very high overall accuracy (less than 1% errors), but errors increased when sequential and structural information conflicted. The frequency of such errors rose sharply in the language-modeling setting. We conclude that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.

[1]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[2]  John Robert Ross,et al.  Constraints on variables in syntax , 1967 .

[3]  M. Bowerman The 'no negative evidence' problem: How do children avoid constructing an overly general grammar? , 1988 .

[4]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[5]  K. Bock,et al.  Broken agreement , 1991, Cognitive Psychology.

[6]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[7]  Sandiway Fong,et al.  Can recurrent neural networks learn natural language grammars? , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[8]  Willem J. M. Levelt,et al.  A theory of lexical access in speech production , 1999, Behavioral and Brain Sciences.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  K. Forster,et al.  Subject-verb agreement processes in comprehension , 1997 .

[11]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[12]  Carson T. Schütze The empirical base of linguistics: Grammaticality judgments and linguistic methodology , 1998 .

[13]  Douglas L. T. Rohde,et al.  Language acquisition in the absence of explicit negative evidence: how important is starting small? , 1999, Cognition.

[14]  Paul Rodríguez,et al.  A Recurrent Neural Network that Learns to Count , 1999, Connect. Sci..

[15]  Mark S. Seidenberg,et al.  The emergence of grammaticality in connectionist networks. , 1999 .

[16]  Paul Rodríguez,et al.  Simple Recurrent Networks Learn Context-Free and Context-Sensitive Languages by Counting , 2001, Neural Computation.

[17]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[18]  H. Hughes The Cambridge Grammar of the English Language , 2003 .

[19]  J. Elman Distributed representations, simple recurrent networks, and grammatical structure , 1991, Machine Learning.

[20]  Kathryn Bock,et al.  Making syntax of sense: number agreement in sentence production. , 2005, Psychological review.

[21]  Andrew Nevins,et al.  Last-Conjunct Agreement in Slovenian , 2007 .

[22]  A. Giannakidou,et al.  Negative and Positive Polarity Items: Variation, Licensing, and Compositionality , 2008 .

[23]  Bo Cartling,et al.  On the implicit acquisition of a context-free grammar by a simple recurrent neural network , 2008, Neurocomputing.

[24]  A. Staub On the interpretation of the number attraction effect: Response time evidence. , 2009, Journal of memory and language.

[25]  Mark Steedman,et al.  Unbounded Dependency Recovery for Parser Evaluation , 2009, EMNLP.

[26]  Joakim Nivre,et al.  Evaluation of Dependency Parsers on Unbounded Dependencies , 2010, COLING.

[27]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[28]  Emily M. Bender,et al.  Parser Evaluation over Local and Non-Local Deep Dependencies in a Large Corpus , 2011, EMNLP.

[29]  Paul Portner,et al.  Semantics: An International Handbook of Natural Language Meaning , 2011 .

[30]  Joakim Nivre,et al.  A Dynamic Oracle for Arc-Eager Dependency Parsing , 2012, COLING.

[31]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[32]  Robert Frank,et al.  The Acquisition of Anaphora by Simple Recurrent Networks , 2013 .

[33]  Alexander Clark,et al.  Statistical Representation of Grammaticality Judgements: the Limits of N-Gram Models , 2013, CMCL.

[34]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[35]  Laurel Brehm,et al.  The time-course of feature interference in agreement comprehension: Multiple mechanisms and asymmetrical attraction. , 2014, Journal of memory and language.

[36]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[37]  Christopher Potts,et al.  Tree-Structured Composition in Neural Networks without Tree-Structured Architectures , 2015, CoCo@NIPS.

[38]  Alexander Clark,et al.  Unsupervised Prediction of Acceptability Judgements , 2015, ACL.

[39]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[40]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[41]  Noam Chomsky,et al.  Structures, Not Strings: Linguistics as Part of the Cognitive Sciences , 2015, Trends in Cognitive Sciences.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Tomas Mikolov,et al.  Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[44]  Phil Blunsom,et al.  Learning to Transduce with Unbounded Memory , 2015, NIPS.

[45]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[46]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[47]  Noah A. Smith,et al.  Recurrent Neural Network Grammars , 2016, NAACL.

[48]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[49]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[50]  Grzegorz Chrupala,et al.  Representation of Linguistic Form and Function in Recurrent Neural Networks , 2016, CL.