How Can We Know What Language Models Know?

Recent work has presented intriguing results examining the knowledge contained in language models (LMs) by having the LM fill in the blanks of prompts such as “Obama is a __ by profession”. These prompts are usually manually created, and quite possibly sub-optimal; another prompt such as “Obama worked as a __ ” may result in more accurately predicting the correct profession. Because of this, given an inappropriate prompt, we might fail to retrieve facts that the LM does know, and thus any given prompt only provides a lower bound estimate of the knowledge contained in an LM. In this paper, we attempt to more accurately estimate the knowledge contained in LMs by automatically discovering better prompts to use in this querying process. Specifically, we propose mining-based and paraphrasing-based methods to automatically generate high-quality and diverse prompts, as well as ensemble methods to combine answers from different prompts. Extensive experiments on the LAMA benchmark for extracting relational knowledge from LMs demonstrate that our methods can improve accuracy from 31.1% to 39.6%, providing a tighter lower bound on what LMs know. We have released the code and the resulting LM Prompt And Query Archive (LPAQA) at https://github.com/jzbjyb/LPAQA.

[1]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[2]  Matthew Richardson,et al.  MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[3]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[4]  Thomas Lukasiewicz,et al.  A Surprisingly Robust Trick for the Winograd Schema Challenge , 2019, ACL.

[5]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[6]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[7]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[8]  Gholamreza Haffari,et al.  Towards Decoding as Continuous Optimisation in Neural Machine Translation , 2017, EMNLP.

[9]  Graham Neubig,et al.  Latent Relation Language Models , 2019, AAAI.

[10]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[11]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[12]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[15]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[16]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[17]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[18]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[19]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[20]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[21]  Ido Dagan,et al.  Investigating a Generic Paraphrase-Based Approach for Relation Extraction , 2006, EACL.

[22]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[23]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[24]  Roy Schwartz,et al.  Knowledge Enhanced Contextual Word Representations , 2019, EMNLP/IJCNLP.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[27]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[28]  Steven Schockaert,et al.  Inducing Relational Knowledge from BERT , 2019, AAAI.

[29]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[30]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[31]  Armen Aghajanyan,et al.  Pre-training via Paraphrasing , 2020, NeurIPS.

[32]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[33]  Sameer Singh,et al.  Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling , 2019, ACL.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Oren Etzioni,et al.  Open Information Extraction: The Second Generation , 2011, IJCAI.

[36]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[37]  Yulia Tsvetkov,et al.  Style Transfer Through Back-Translation , 2018, ACL.

[38]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[39]  Christophe Gravier,et al.  T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples , 2018, LREC.

[40]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[41]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[42]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[43]  Shrey Desai,et al.  Calibration of Pre-trained Transformers , 2020, EMNLP.

[44]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[45]  Daniel Khashabi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, EMNLP.

[46]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[47]  David Chiang,et al.  Correcting Length Bias in Neural Machine Translation , 2018, WMT.

[48]  Christopher J. C. Burges,et al.  The Microsoft Research Sentence Completion Challenge , 2011 .

[49]  Noah A. Smith,et al.  What Do Recurrent Neural Network Grammars Learn About Syntax? , 2016, EACL.

[50]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[51]  Michael Gamon,et al.  Representing Text for Joint Embedding of Text and Knowledge Bases , 2015, EMNLP.

[52]  Jeffrey Ling,et al.  Matching the Blanks: Distributional Similarity for Relation Learning , 2019, ACL.

[53]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[54]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[55]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[56]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[57]  Rahul Khanna,et al.  Can BERT Reason? Logically Equivalent Probes for Evaluating the Inference Capabilities of Language Models , 2020, ArXiv.

[58]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.

[59]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[60]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[61]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[62]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[63]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[64]  Noah A. Smith,et al.  Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , 2019, EMNLP.

[65]  Colin Raffel,et al.  How Much Knowledge Can You Pack Into the Parameters of a Language Model? , 2020, EMNLP.

[66]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[67]  Rahul Bhagat,et al.  Large Scale Acquisition of Paraphrases for Learning Surface Patterns , 2008, ACL.

[68]  Yejin Choi,et al.  ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[69]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[70]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[71]  Ulli Waltinger,et al.  BERT is Not a Knowledge Base (Yet): Factual Knowledge vs. Name-Based Reasoning in Unsupervised QA , 2019, ArXiv.

[72]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[73]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[74]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[75]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[76]  Tushar Khot,et al.  QASC: A Dataset for Question Answering via Sentence Composition , 2020, AAAI.

[77]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[78]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[79]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[80]  Fabio Petroni,et al.  How Context Affects Language Models' Factual Predictions , 2020, AKBC.

[81]  Mirella Lapata,et al.  Paraphrasing Revisited with Neural Machine Translation , 2017, EACL.

[82]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[83]  Jonathan Berant,et al.  Injecting Numerical Reasoning Skills into Language Models , 2020, ACL.

[84]  Hong Yu,et al.  Calibrating Structured Output Predictors for Natural Language Processing , 2020, ACL.

[85]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[86]  Sameer Singh,et al.  Do NLP Models Know Numbers? Probing Numeracy in Embeddings , 2019, EMNLP.

[87]  Jihoon Kim,et al.  Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..

[88]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[89]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[90]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[91]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[92]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[93]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, ArXiv.

[94]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[95]  Mirella Lapata,et al.  Confidence Modeling for Neural Semantic Parsing , 2018, ACL.

[96]  Jonathan Berant,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[97]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[98]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[99]  Wang Ling,et al.  Reference-Aware Language Models , 2016, EMNLP.

[100]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[101]  Yoshua Bengio,et al.  A Neural Knowledge Language Model , 2016, ArXiv.

[102]  Michael Gamon,et al.  Sentence-level MT evaluation without reference translations: beyond language modeling , 2005, EAMT.

[103]  Kevin Lin,et al.  Reasoning Over Paragraph Effects in Situations , 2019, MRQA@EMNLP.