Tabula Nearly Rasa: Probing the Linguistic Knowledge of Character-level Neural Language Models Trained on Unsegmented Text

Recurrent neural networks (RNNs) have reached striking performance in many natural language processing tasks. This has renewed interest in whether these generic sequence processing devices are inducing genuine linguistic knowledge. Nearly all current analytical studies, however, initialize the RNNs with a vocabulary of known words, and feed them tokenized input during training. We present a multi-lingual study of the linguistic knowledge encoded in RNNs trained as character-level language models, on input data with word boundaries removed. These networks face a tougher and more cognitively realistic task, having to discover any useful linguistic unit from scratch based on input statistics. The results show that our “near tabula rasa” RNNs are mostly able to solve morphological, syntactic and semantic tasks that intuitively presuppose word-level knowledge, and indeed they learned, to some extent, to track word boundaries. Our study opens the door to speculations about the necessity of an explicit, rigid word lexicon in language learning and usage.

[1]  M. Bar The proactive brain: using analogies and associations to generate predictions , 2007, Trends in Cognitive Sciences.

[2]  Stefano Fusi,et al.  Why neurons mix: high dimensionality for higher cognition , 2016, Current Opinion in Neurobiology.

[3]  Liang Lu,et al.  Top-down Tree Long Short-Term Memory Networks , 2015, NAACL.

[4]  Wesley De Neve,et al.  Explaining Character-Aware Neural Networks for Word-Level Prediction: Do They Discover Linguistic Rules? , 2018, EMNLP.

[5]  Balthasar Bickel,et al.  The prosodic word is not universal, but emergent , 2010 .

[6]  A. Goldberg Constructions at Work: The Nature of Generalization in Language , 2006 .

[7]  Noam Chomsky,et al.  Remarks on Nominalization , 2020, Nominalization.

[8]  Ivan A. Sag,et al.  Syntactic Theory: A Formal Introduction , 1999, Computational Linguistics.

[9]  Robert Dixon,et al.  Word: Index of languages and language families , 2003 .

[10]  Robert Frank,et al.  Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks , 2018, CogSci.

[11]  Adam Lopez,et al.  Indicatements that character language models learn English morpho-syntactic units and regularities , 2018, BlackboxNLP@EMNLP.

[12]  J. Bresnan Lexical-Functional Syntax , 2000 .

[13]  P. Kuhl Early language acquisition: cracking the speech code , 2004, Nature Reviews Neuroscience.

[14]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[15]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[16]  Tomas Mikolov,et al.  Alternative structures for character-level RNNs , 2015, ArXiv.

[17]  Aubrie Woods Exploiting Linguistic Features for Sentence Completion , 2016, ACL.

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Alexander Clark,et al.  Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge , 2017, Cogn. Sci..

[20]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[21]  Martin Haspelmath,et al.  The indeterminacy of word segmentation and the nature of morphology and syntax , 2011 .

[22]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[23]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[24]  Sabine Brants,et al.  The TIGER Treebank , 2001 .

[25]  Morten H. Christiansen,et al.  Learning to Segment Speech Using Multiple Cues: A Connectionist Model , 1998 .

[26]  Christopher J. C. Burges,et al.  The Microsoft Research Sentence Completion Challenge , 2011 .

[27]  Ilya Sutskever,et al.  Learning to Generate Reviews and Discovering Sentiment , 2017, ArXiv.

[28]  R. Jackendoff Foundations of Language: Brain, Meaning, Grammar, Evolution , 2002 .

[29]  Geoffrey Zweig,et al.  Computational Approaches to Sentence Completion , 2012, ACL.

[30]  Grzegorz Chrupala,et al.  Representation of Linguistic Form and Function in Recurrent Neural Networks , 2016, CL.

[31]  Julian M. Pine,et al.  Constructing a Language: A Usage-Based Theory of Language Acquisition. , 2004 .

[32]  Aren Jansen,et al.  Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Yoav Goldberg,et al.  Neural Network Methods for Natural Language Processing , 2017, Synthesis Lectures on Human Language Technologies.

[34]  Willem H. Zuidema,et al.  Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , 2017, J. Artif. Intell. Res..

[35]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[36]  Fernando Zúñiga,et al.  The 'word' in polysynthetic languages: phonological and syntactic challenges , 2017 .

[37]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[38]  Anna Korhonen,et al.  Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction , 2018, TACL.

[39]  Oriol Vinyals,et al.  Multilingual Language Processing From Bytes , 2015, NAACL.

[40]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[41]  Ryan Cotterell,et al.  Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate , 2018, TACL.

[42]  William Croft,et al.  Cognitive Linguistics , 2004 .

[43]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[44]  Ryan Cotterell,et al.  Are All Languages Equally Hard to Language-Model? , 2018, NAACL.

[45]  Rico Sennrich,et al.  How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs , 2016, EACL.

[46]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[47]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[48]  Morten H. Christiansen,et al.  Multiple-Cue Integration in Language Acquisition : A Connectionist Model of Speech Segmentation and Rule-like Behavior , 2004 .

[49]  Joe Pater Generative linguistics and neural networks at 60: Foundation, friction, and fusion , 2019, Language.

[50]  Ankur Bapna,et al.  Revisiting Character-Based Neural Machine Translation with Capacity and Compression , 2018, EMNLP.

[51]  Anna Maria Di Sciullo,et al.  Linguistic Society of America On the Definition of Word by , 2013 .

[52]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[53]  Z. Harris,et al.  Foundations of language , 1941 .

[54]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[55]  Grzegorz Chrupala,et al.  Encoding of phonology in a recurrent neural model of grounded speech , 2017, CoNLL.

[56]  Ilya Sutskever,et al.  SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[57]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[58]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[59]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[60]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[61]  Joakim Nivre,et al.  Universal Dependency Annotation for Multilingual Parsing , 2013, ACL.

[62]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[63]  Ray Jackendoff TWISTIN' THE NIGHT AWAY , 1997 .

[64]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[65]  Ryan Cotterell,et al.  Neural Morphological Analysis: Encoding-Decoding Canonical Segments , 2016, EMNLP.

[66]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[67]  Elisabetta ek,et al.  The Lexicon: An Introduction , 2016 .

[68]  Jessica Maye,et al.  Infant sensitivity to distributional information can affect phonetic discrimination , 2002, Cognition.

[69]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[70]  Hinrich Schütze,et al.  Nonsymbolic Text Representation , 2016, EACL.

[71]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[72]  Robert Frank,et al.  The Acquisition of Anaphora by Simple Recurrent Networks , 2013 .

[73]  E. Williams,et al.  On the definition of word , 1987 .

[74]  J. Bresnan,et al.  The lexical integrity principle: Evidence from Bantu , 1995 .

[75]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.