Can neural networks acquire a structural bias from raw linguistic data?

We evaluate whether BERT, a widely used neural network for sentence processing, acquires an inductive bias towards forming structural generalizations through pretraining on raw data. We conduct four experiments testing its preference for structural vs. linear generalizations in different structure-dependent phenomena. We find that BERT makes a structural generalization in 3 out of 4 empirical domains---subject-auxiliary inversion, reflexive binding, and verb tense detection in embedded clauses---but makes a linear generalization when tested on NPI licensing. We argue that these results are the strongest evidence so far from artificial learners supporting the proposition that a structural bias can be acquired from raw data. If this conclusion is correct, it is tentative evidence that some linguistic universals can be acquired by learners without innate biases. However, the precise implications for human language acquisition are unclear, as humans learn language from significantly less data than BERT.

[1]  Charles D. Yang,et al.  Empirical re-assessment of stimulus poverty arguments , 2002 .

[2]  C. Phillips,et al.  Illusory licensing effects across dependency types: ERP evidence , 2009, Brain and Language.

[3]  Noam Chomsky,et al.  Lectures on Government and Binding , 1981 .

[4]  Ankit Singh Rawat,et al.  Are Transformers universal approximators of sequence-to-sequence functions? , 2020, ICLR.

[5]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[6]  R. Thomas McCoy,et al.  Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks , 2020, TACL.

[7]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[8]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[9]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[10]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[11]  Jonas Schmitt,et al.  Knowledge And Learning In Natural Language , 2016 .

[12]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[13]  Barbara C. Scholz,et al.  Empirical assessment of stimulus poverty arguments , 2002 .

[14]  Noam Chomsky,et al.  वाक्यविन्यास का सैद्धान्तिक पक्ष = Aspects of the theory of syntax , 1965 .

[15]  Colin Wilson,et al.  Learning Phonology With Substantive Bias: An Experimental and Computational Study of Velar Palatalization , 2006, Cogn. Sci..

[16]  Carson T. Schütze The empirical base of linguistics: Grammaticality judgments and linguistic methodology , 1998 .

[17]  Allyson Ettinger,et al.  Probing for semantic evidence of composition by means of simple classification tasks , 2016, RepEval@ACL.

[18]  Morten H. Christiansen,et al.  Uncovering the Richness of the Stimulus: Structure Dependence and Indirect Statistical Evidence , 2005, Cogn. Sci..

[19]  Roger Levy,et al.  What do RNN Language Models Learn about Filler–Gap Dependencies? , 2018, BlackboxNLP@EMNLP.

[20]  Charles D. Yang,et al.  Knowledge and learning in natural language , 2000 .

[21]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[22]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[23]  J. Tenenbaum,et al.  The learnability of abstract syntactic principles , 2011, Cognition.

[24]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[25]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[26]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[27]  Nicolò Cesa-Bianchi,et al.  Advances in Neural Information Processing Systems 31 , 2018, NIPS 2018.

[28]  Tom Sercu,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2021, Proceedings of the National Academy of Sciences.

[29]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[30]  Noam Chomsky,et al.  Problems of knowledge and freedom : the Russell lectures , 1970 .

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  S. Crain,et al.  Structure dependence in grammar formation , 1987 .

[33]  R. Frank,et al.  Transformational Networks , 2007 .

[34]  Samuel R. Bowman,et al.  BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[35]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[36]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[37]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[38]  Robert Frank,et al.  Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks , 2018, CogSci.

[39]  S. Laurence,et al.  The Poverty of the Stimulus Argument , 2001, The British Journal for the Philosophy of Science.

[40]  Alex Wang,et al.  jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models , 2020, ACL.

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  J. Elman,et al.  Learnability and the Statistical Structure of Language: Poverty of Stimulus Arguments Revisited , 2004 .

[43]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.