Measuring and Improving Consistency in Pretrained Language Models

Abstract Consistency of a model—that is, the invariance of its behavior under meaning-preserving alternations in its input—is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel🤘, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel🤘, we show that the consistency of all PLMs we experiment with is poor— though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.1

[1]  Peter Clark,et al.  Enriching a Model's Notion of Belief using a Persistent Memory , 2021, ArXiv.

[2]  Philipp Dufter,et al.  Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models , 2021, EACL.

[3]  Yoav Goldberg,et al.  Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals , 2020, Transactions of the Association for Computational Linguistics.

[4]  Eduard Hovy,et al.  On the Systematicity of Probing Contextualized Word Representations: The Case of Hypernymy in BERT , 2020, STARSEM.

[5]  Yejin Choi,et al.  Do Neural Language Models Overcome Reporting Bias? , 2020, COLING.

[6]  Sameer Singh,et al.  Eliciting Knowledge from Language Models Using Automatically Generated Prompts , 2020, EMNLP.

[7]  Sebastian Riedel,et al.  Neural Databases , 2020, ArXiv.

[8]  Samuel R. Bowman,et al.  Learning Helpful Inductive Biases from Self-Supervised Pretraining , 2020, EMNLP.

[9]  Jacob Goldberger,et al.  Unsupervised Distillation of Syntactic Information from Contextualized Word Representations , 2020, BLACKBOXNLP.

[10]  Dan Roth,et al.  Do Language Embeddings capture Scales? , 2020, BLACKBOXNLP.

[11]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[12]  Benno Krojer,et al.  Are Pretrained Language Models Symbolic Reasoners over Knowledge? , 2020, CONLL.

[13]  Yoav Goldberg,et al.  Syntactic Search by Example , 2020, ACL.

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Christopher D. Manning,et al.  Finding Universal Grammatical Relations in Multilingual BERT , 2020, ACL.

[16]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[17]  Tal Linzen,et al.  How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[18]  Rotem Dror,et al.  Statistical Significance Testing for Natural Language Processing , 2020, Synthesis Lectures on Human Language Technologies.

[19]  Hannaneh Hajishirzi,et al.  Logic-Guided Data Augmentation and Regularization for Consistent Question Answering , 2020, ACL.

[20]  Jonathan Berant,et al.  Injecting Numerical Reasoning Skills into Language Models , 2020, ACL.

[21]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[22]  Fabio Petroni,et al.  How Context Affects Language Models' Factual Predictions , 2020, AKBC.

[23]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[24]  Yoav Goldberg,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[25]  Wenhan Xiong,et al.  Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model , 2019, ICLR.

[26]  Frank F. Xu,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[27]  Hinrich Schütze,et al.  E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT , 2019, FINDINGS.

[28]  Hinrich Schütze,et al.  Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly , 2019, ACL.

[29]  Richard Socher,et al.  Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[30]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32]  Thomas Lukasiewicz,et al.  Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations , 2019, ACL.

[33]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[34]  Noah A. Smith,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP.

[35]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[36]  Alexander M. Rush,et al.  Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[37]  Vivek Srikumar,et al.  A Logic-Driven Framework for Consistency of Neural Models , 2019, EMNLP.

[38]  Yejin Choi,et al.  Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.

[39]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[40]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[41]  Hwee Tou Ng,et al.  Improving the Robustness of Question Answering Systems to Question Paraphrasing , 2019, ACL.

[42]  Sameer Singh,et al.  Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[43]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[44]  Xinya Du,et al.  Be Consistent! Improving Procedural Text Comprehension using Label Consistency , 2019, NAACL.

[45]  Michael Collins,et al.  Synthetic QA Corpora Generation with Roundtrip Consistency , 2019, ACL.

[46]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[47]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[48]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[49]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[50]  Thomas Lukasiewicz,et al.  e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[51]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[52]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[53]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[54]  Christophe Gravier,et al.  T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[55]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[56]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[57]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[58]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[59]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[60]  Ngoc Thang Vu,et al.  Integrating Distributional Lexical Contrast into Word Embeddings for Antonym-Synonym Distinction , 2016, ACL.

[61]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[62]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[63]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[64]  Eduard H. Hovy,et al.  Squibs: What Is a Paraphrase? , 2013, CL.

[65]  Matthias Thimm,et al.  Inconsistency measures for probabilistic logics , 2013, Artif. Intell..

[66]  Dan Roth,et al.  Inference Protocols for Coreference Resolution , 2011, CoNLL Shared Task.

[67]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[68]  Matthias Thimm,et al.  Measuring Inconsistency in Probabilistic Knowledge Bases , 2009, UAI.

[69]  Pascal Denis,et al.  Global joint models for coreference resolution and named entity classification , 2009, Proces. del Leng. Natural.

[70]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[71]  Daniele Pretolani,et al.  Easy Cases of Probabilistic Satisfiability , 2001, Annals of Mathematics and Artificial Intelligence.

[72]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[73]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[74]  Jungo Kasai,et al.  Understanding Commonsense Inference Aptitude of Deep Contextual Representations , 2019, Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing.

[75]  Larry G. Daniel,et al.  Statistical Significance Testing: A Historical Overview of Misuse and Misinterpretation with Implications for the Editorial Policies of Educational Journals , 1998 .

[76]  David Picado-Muiño,et al.  Measuring and repairing inconsistency in probabilistic knowledge bases , 2011, Int. J. Approx. Reason..

[77]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[78]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[79]  W. Spears Probabilistic Satisfiability , 1992 .