On the Systematicity of Probing Contextualized Word Representations: The Case of Hypernymy in BERT

Contextualized word representations have become a driving force in NLP, motivating widespread interest in understanding their capabilities and the mechanisms by which they operate. Particularly intriguing is their ability to identify and encode conceptual abstractions. Past work has probed BERT representations for this competence, finding that BERT can correctly retrieve noun hypernyms in cloze tasks. In this work, we ask the question: do probing studies shed light on systematic knowledge in BERT representations? As a case study, we examine hypernymy knowledge encoded in BERT representations. In particular, we demonstrate through a simple consistency probe that the ability to correctly retrieve hypernyms in cloze tasks, as used in prior work, does not correspond to systematic knowledge in BERT. Our main conclusion is cautionary: even if BERT demonstrates high probing accuracy for a particular competence, it does not necessarily follow that BERT ‘understands’ a concept, and it cannot be expected to systematically generalize across applicable contexts.

[1]  W. Montague,et al.  Category norms of verbal items in 56 categories A replication and extension of the Connecticut category norms , 1969 .

[2]  David J. Weir,et al.  A General Framework for Distributional Similarity , 2003, EMNLP.

[3]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[4]  H. Wellman,et al.  Cognitive development: foundational theories of core domains. , 1992, Annual review of psychology.

[5]  Roser Morante,et al.  Modality and Negation: An Introduction to the Special Issue , 2012, CL.

[6]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[7]  Dan Moldovan,et al.  Discovery of Manner Relations and Their Applicability to Question Answering , 2003, ACL 2003.

[8]  Vered Shwartz,et al.  Integrating Multiplicative Features into Supervised Distributional Methods for Lexical Entailment , 2018, *SEMEVAL.

[9]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[10]  Daniel Jurafsky,et al.  Semantic Taxonomy Induction from Heterogenous Evidence , 2006, ACL.

[11]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[12]  Bonnie L. Webber,et al.  Neural Networks For Negation Scope Detection , 2016, ACL.

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  Yejin Choi,et al.  Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Steven Schockaert,et al.  Inducing Relational Knowledge from BERT , 2019, AAAI.

[18]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[19]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[20]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[21]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[22]  Omer Levy,et al.  Learning to Exploit Structured Resources for Lexical Inference , 2015, CoNLL.

[23]  Jonathan Berant,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[24]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[25]  Lucia Specia,et al.  Source-Language Entailment Modeling for Translating Unknown Terms , 2009, ACL.

[26]  Ido Dagan,et al.  Still a Pain in the Neck: Evaluating Text Representations on Lexical Composition , 2019, TACL.

[27]  Raffaella Bernardi,et al.  Entailment above the word level in distributional semantics , 2012, EACL.

[28]  Sameer Singh,et al.  Do NLP Models Know Numbers? Probing Numeracy in Embeddings , 2019, EMNLP.

[29]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[30]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[31]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[32]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[33]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[34]  Alex Wang,et al.  Probing What Different NLP Tasks Teach Machines about Function Word Comprehension , 2019, *SEMEVAL.

[35]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[36]  A. Tversky Features of Similarity , 1977 .

[37]  Gerhard Weikum,et al.  Robust question answering over the web of linked data , 2013, CIKM.

[38]  Salim Roukos,et al.  Brain potentials related to stages of sentence verification. , 1983, Psychophysiology.

[39]  Dragomir R. Radev,et al.  Question-answering by predictive annotation , 2000, SIGIR '00.

[40]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[41]  Willem H. Zuidema,et al.  Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , 2017, J. Artif. Intell. Res..

[42]  Marine Carpuat,et al.  Detecting Asymmetric Semantic Relations in Context: A Case-Study on Hypernymy Detection , 2017, *SEM.

[43]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[44]  Allyson Ettinger What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, Transactions of the Association for Computational Linguistics.

[45]  Michael Mohler,et al.  Semantic Signatures for Example-Based Linguistic Metaphor Detection , 2013 .

[46]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[47]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[48]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[49]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[50]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[51]  Mark Dras,et al.  Using Hypernymy Acquisition to Tackle (Part of) Textual Entailment , 2009, TextInfer@ACL.

[52]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[53]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[54]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[55]  Gemma Boleda,et al.  Inclusive yet Selective: Supervised Distributional Hypernymy Detection , 2014, COLING.

[56]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[57]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[58]  Chu-Ren Huang,et al.  Nine Features in a Random Forest to Learn Taxonomical Semantic Relations , 2016, LREC.

[59]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[60]  Noah A. Smith,et al.  Shallow Syntax in Deep Water , 2019, ArXiv.

[61]  David J. Weir,et al.  Learning to Distinguish Hypernyms and Co-Hyponyms , 2014, COLING.

[62]  Hinrich Schutze,et al.  Negated LAMA: Birds cannot fly , 2019, ArXiv.

[63]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[64]  Christiane Fellbaum,et al.  Towards a Representation of Idioms in WordNet , 1998, WordNet@ACL/COLING.

[65]  Chu-Ren Huang,et al.  EVALution 1.0: an Evolving Semantic Dataset for Training and Evaluation of Distributional Semantic Models , 2015, LDL@IJCNLP.

[66]  Ivan Vulić,et al.  Specialising Word Vectors for Lexical Entailment , 2017, NAACL.

[67]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[68]  Jungo Kasai,et al.  Cracking the Contextual Commonsense Code: Understanding Commonsense Reasoning Aptitude of Deep Contextual Representations , 2019, EMNLP.

[69]  Kathleen McKeown,et al.  Classifying Taxonomic Relations between Pairs of Wikipedia Articles , 2013, IJCNLP.

[70]  Omer Levy,et al.  Do Supervised Distributional Methods Really Learn Lexical Inference Relations? , 2015, NAACL.

[71]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[72]  Silvia Necsulescu Automatic Acquisition of Possible Contexts for Low-Frequent Words , 2011, RANLP Student Research Workshop.

[73]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[74]  E. Rosch,et al.  Cognition and Categorization , 1980 .

[75]  Alessandro Lenci,et al.  How we BLESSed distributional semantic evaluation , 2011, GEMS.

[76]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[77]  Laura Rimell,et al.  Distributional Lexical Entailment by Topic Coherence , 2014, EACL.

[78]  Daniel Jurafsky,et al.  Learning Syntactic Patterns for Automatic Hypernym Discovery , 2004, NIPS.

[79]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[80]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[81]  Nicola Guarino,et al.  Restructuring WordNet's Top-Level: The OntoClean approach , 2002 .