Do Syntactic Probes Probe Syntax? Experiments with Jabberwocky Probing

Analysing whether neural language models encode linguistic information has become popular in NLP. One method of doing so, which is frequently cited to support the claim that models like BERT encode syntax, is called probing; probes are small supervised models trained to extract linguistic information from another model’s output. If a probe is able to predict a particular structure, it is argued that the model whose output it is trained on must have implicitly learnt to encode it. However, drawing a generalisation about a model’s linguistic knowledge about a specific phenomena based on what a probe is able to learn may be problematic: in this work, we show that semantic cues in training data means that syntactic probes do not properly isolate syntax. We generate a new corpus of semantically nonsensical but syntactically well-formed Jabberwocky sentences, which we use to evaluate two probes trained on normal data. We train the probes on several popular language models (BERT, GPT-2, and RoBERTa), and find that in all settings they perform worse when evaluated on these data, for one probe by an average of 15.4 UUAS points absolute. Although in most cases they still outperform the baselines, their lead is reduced substantially, e.g. by 53% in the case of BERT for one probe. This begs the question: what empirical scores constitute knowing syntax?

[1]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[2]  Han Hye-Chung From Chess to Hypertext: Games of Through the Looking-Glass and What Alice Found There , 2019, The New Korean Journal of English Lnaguage & Literature.

[3]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[4]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Jungo Kasai,et al.  Jabberwocky Parsing: Dependency Parsing with Lexical Noise , 2019 .

[7]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[8]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[9]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[10]  Junru Zhou,et al.  Head-Driven Phrase Structure Grammar Parsing on Penn Treebank , 2019, ACL.

[11]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[12]  Ryan Cotterell,et al.  A Tale of a Probe and a Parser , 2020, ACL.

[13]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[14]  Ted Briscoe,et al.  Evaluating the Accuracy of an Unlexicalized Statistical Parser on the PARC DepBank , 2006, ACL.

[15]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Xinyan Xiao,et al.  SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis , 2020, ACL.

[18]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19]  M. Coltheart,et al.  358,534 nonwords: The ARC Nonword Database , 2002, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[20]  Fernando Pereira,et al.  Non-Projective Dependency Parsing using Spanning Tree Algorithms , 2005, HLT.

[21]  Rowan Hall Maudslay,et al.  Information-Theoretic Probing for Linguistic Structure , 2020, ACL.

[22]  A. Cayley A theorem on trees , 2009 .

[23]  R. Prim Shortest connection networks and some generalizations , 1957 .

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[26]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[27]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[28]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[29]  Willem H. Zuidema,et al.  Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , 2017, J. Artif. Intell. Res..

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.