论文信息 - Measuring and Improving Consistency in Pretrained Language Models - 字舞流文

Measuring and Improving Consistency in Pretrained Language Models

Abstract Consistency of a model—that is, the invariance of its behavior under meaning-preserving alternations in its input—is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel🤘, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel🤘, we show that the consistency of all PLMs we experiment with is poor— though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.1

E. Hovy | Hinrich Schütze | Yoav Goldberg | Yanai Elazar | Nora Kassner | Shauli Ravfogel | Abhilasha Ravichander

[1] Peter Clark,et al. Enriching a Model's Notion of Belief using a Persistent Memory , 2021, ArXiv.

[2] Philipp Dufter,et al. Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models , 2021, EACL.

[3] Yoav Goldberg,et al. Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals , 2020, Transactions of the Association for Computational Linguistics.

[4] Eduard Hovy,et al. On the Systematicity of Probing Contextualized Word Representations: The Case of Hypernymy in BERT , 2020, STARSEM.

[5] Yejin Choi,et al. Do Neural Language Models Overcome Reporting Bias? , 2020, COLING.

[6] Sameer Singh,et al. Eliciting Knowledge from Language Models Using Automatically Generated Prompts , 2020, EMNLP.

[7] Sebastian Riedel,et al. Neural Databases , 2020, ArXiv.

[8] Samuel R. Bowman,et al. Learning Helpful Inductive Biases from Self-Supervised Pretraining , 2020, EMNLP.

[9] Jacob Goldberger,et al. Unsupervised Distillation of Syntactic Information from Contextualized Word Representations , 2020, BLACKBOXNLP.

[10] Dan Roth,et al. Do Language Embeddings capture Scales? , 2020, BLACKBOXNLP.

[11] Emily M. Bender,et al. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[12] Benno Krojer,et al. Are Pretrained Language Models Symbolic Reasoners over Knowledge? , 2020, CONLL.

[13] Yoav Goldberg,et al. Syntactic Search by Example , 2020, ACL.

[14] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[15] Christopher D. Manning,et al. Finding Universal Grammatical Relations in Multilingual BERT , 2020, ACL.

[16] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[17] Tal Linzen,et al. How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[18] Rotem Dror,et al. Statistical Significance Testing for Natural Language Processing , 2020, Synthesis Lectures on Human Language Technologies.

[19] Hannaneh Hajishirzi,et al. Logic-Guided Data Augmentation and Regularization for Consistent Question Answering , 2020, ACL.

[20] Jonathan Berant,et al. Injecting Numerical Reasoning Skills into Language Models , 2020, ACL.

[21] Anna Rumshisky,et al. A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[22] Fabio Petroni,et al. How Context Affects Language Models' Factual Predictions , 2020, AKBC.

[23] Colin Raffel,et al. How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[24] Yoav Goldberg,et al. oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[25] Wenhan Xiong,et al. Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model , 2019, ICLR.

[26] Frank F. Xu,et al. How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[27] Hinrich Schütze,et al. E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT , 2019, FINDINGS.

[28] Hinrich Schütze,et al. Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly , 2019, ACL.

[29] Richard Socher,et al. Evaluating the Factual Consistency of Abstractive Text Summarization , 2019, EMNLP.

[30] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[31] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[32] Thomas Lukasiewicz,et al. Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations , 2019, ACL.

[33] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[34] Noah A. Smith,et al. Knowledge Enhanced Contextual Word Representations , 2019, EMNLP.

[35] Sebastian Riedel,et al. Language Models as Knowledge Bases? , 2019, EMNLP.

[36] Alexander M. Rush,et al. Commonsense Knowledge Mining from Pretrained Models , 2019, EMNLP.

[37] Vivek Srikumar,et al. A Logic-Driven Framework for Consistency of Neural Models , 2019, EMNLP.

[38] Yejin Choi,et al. Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.

[39] Allyson Ettinger,et al. What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[40] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[41] Hwee Tou Ng,et al. Improving the Robustness of Question Answering Systems to Question Paraphrasing , 2019, ACL.

[42] Sameer Singh,et al. Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[43] Christopher D. Manning,et al. A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[44] Xinya Du,et al. Be Consistent! Improving Procedural Text Comprehension using Label Consistency , 2019, NAACL.

[45] Michael Collins,et al. Synthetic QA Corpora Generation with Roundtrip Consistency , 2019, ACL.

[46] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[47] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[48] Yoav Goldberg,et al. Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[49] Yonatan Belinkov,et al. Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[50] Thomas Lukasiewicz,et al. e-SNLI: Natural Language Inference with Natural Language Explanations , 2018, NeurIPS.

[51] Tal Linzen,et al. Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[52] Carlos Guestrin,et al. Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[53] Rotem Dror,et al. The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[54] Christophe Gravier,et al. T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[55] Percy Liang,et al. Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[56] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[57] Yonatan Belinkov,et al. What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[58] Emmanuel Dupoux,et al. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[59] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[60] Ngoc Thang Vu,et al. Integrating Distributional Lexical Contrast into Word Embeddings for Antonym-Synonym Distinction , 2016, ACL.

[61] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[62] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[63] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[64] Eduard H. Hovy,et al. Squibs: What Is a Paraphrase? , 2013, CL.

[65] Matthias Thimm,et al. Inconsistency measures for probabilistic logics , 2013, Artif. Intell..

[66] Dan Roth,et al. Inference Protocols for Coreference Resolution , 2011, CoNLL Shared Task.

[67] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[68] Matthias Thimm,et al. Measuring Inconsistency in Probabilistic Knowledge Bases , 2009, UAI.

[69] Pascal Denis,et al. Global joint models for coreference resolution and named entity classification , 2009, Proces. del Leng. Natural.

[70] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[71] Daniele Pretolani,et al. Easy Cases of Probabilistic Satisfiability , 2001, Annals of Mathematics and Artificial Intelligence.

[72] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[73] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[74] Jungo Kasai,et al. Understanding Commonsense Inference Aptitude of Deep Contextual Representations , 2019, Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing.

[75] Larry G. Daniel,et al. Statistical Significance Testing: A Historical Overview of Misuse and Misinterpretation with Implications for the Editorial Policies of Educational Journals , 1998 .

[76] David Picado-Muiño,et al. Measuring and repairing inconsistency in probabilistic knowledge bases , 2011, Int. J. Approx. Reason..

[77] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[78] Danqi Chen,et al. of the Association for Computational Linguistics: , 2001 .

[79] W. Spears. Probabilistic Satisfiability , 1992 .