The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources in Natural Language Understanding Systems

Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural language models. These models often make inferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model's pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution subtasks that require reasoning over multiple facts. These subtasks differ in terms of which knowledge sources contain the relevant facts. We also introduce subtasks where knowledge is present only at inference time using fictional knowledge. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources. Still, even the best performing models seem to have difficulties with reliably integrating knowledge presented only at inference time.

[1]  Andrew Kyle Lampinen,et al.  Transformers generalize differently from information stored in context vs in weights , 2022, ArXiv.

[2]  Nikhil Ramesh,et al.  Entity-Based Knowledge Conflicts in Question Answering , 2021, EMNLP.

[3]  Eunsol Choi,et al.  CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge , 2021, NeurIPS Datasets and Benchmarks.

[4]  Nebojsa Jojic,et al.  GPT Perdetry Test: Generating new meanings for new words , 2021, NAACL.

[5]  Alexander M. Rush,et al.  How many data points is a prompt worth? , 2021, NAACL.

[6]  Weiran Xu,et al.  From context-aware to knowledge-aware: Boosting OOV tokens recognition in slot tagging with background knowledge , 2021, Neurocomputing.

[7]  Sangwhan Moon,et al.  PatchBERT: Just-in-Time, Out-of-Vocabulary Patching , 2020, EMNLP.

[8]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[9]  Allyson Ettinger,et al.  PeTra: A Sparsely Supervised Memory Model for People Tracking , 2020, ACL.

[10]  Bill Yuchen Lin,et al.  RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms , 2020, EMNLP.

[11]  Fabio Petroni,et al.  How Context Affects Language Models' Factual Predictions , 2020, AKBC.

[12]  Oyvind Tafjord,et al.  Transformers as Soft Reasoners over Language , 2020, IJCAI.

[13]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[14]  Yang Trista Cao,et al.  Toward Gender-Inclusive Coreference Resolution , 2019, ACL.

[15]  Anders Søgaard,et al.  Rewarding Coreference Resolvers for Being Consistent with World Knowledge , 2019, EMNLP.

[16]  Daniel S. Weld,et al.  BERT for Coreference Resolution: Baselines and Analysis , 2019, EMNLP.

[17]  Ronan Le Bras,et al.  WinoGrande , 2019, AAAI.

[18]  Yan Song,et al.  Knowledge-aware Pronoun Coreference Resolution , 2019, ACL.

[19]  Sandeep Attree,et al.  Gendered Ambiguous Pronouns Shared Task: Boosting Model Confidence by Evidence Pooling , 2019, Proceedings of the First Workshop on Gender Bias in Natural Language Processing.

[20]  Peter Clark,et al.  Declarative Question Answering over Knowledge Bases containing Natural Language Text with Answer Set Programming , 2019, AAAI.

[21]  Hinrich Schütze,et al.  Rare Words: A Major Problem for Contextualized Embeddings And How to Fix it by Attentive Mimicking , 2019, AAAI.

[22]  Shikha Bordia,et al.  Identifying and Reducing Gender Bias in Word-Level Language Models , 2019, NAACL.

[23]  Jackie Chi Kit Cheung,et al.  The KnowRef Coreference Corpus: Removing Gender and Number Cues for Difficult Pronominal Anaphora Resolution , 2018, ACL.

[24]  William Yang Wang,et al.  WikiHow: A Large Scale Text Summarization Dataset , 2018, ArXiv.

[25]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[26]  Jason Baldridge,et al.  Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.

[27]  Bhavana Dalvi,et al.  Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension , 2018, NAACL.

[28]  Luke S. Zettlemoyer,et al.  Higher-Order Coreference Resolution with Coarse-to-Fine Inference , 2018, NAACL.

[29]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[30]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[31]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Gerhard Weikum,et al.  YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames , 2016, SEMWEB.

[34]  Christopher D. Manning,et al.  Deep Reinforcement Learning for Mention-Ranking Coreference Models , 2016, EMNLP.

[35]  Christopher D. Manning,et al.  Entity-Centric Coreference Resolution with Model Stacking , 2015, ACL.

[36]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[37]  Dan Klein,et al.  Easy Victories and Uphill Battles in Coreference Resolution , 2013, EMNLP.

[38]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[39]  Vincent Ng,et al.  Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge , 2012, EMNLP.

[40]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[41]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Schemas and their Participants , 2009, ACL.

[42]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[43]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[44]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[45]  R. Das,et al.  When Not to Trust Language Models: Investigating Effectiveness and Limitations of Parametric and Non-Parametric Memories , 2022, ArXiv.

[46]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[47]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[48]  Jungo Kasai,et al.  Jabberwocky Parsing: Dependency Parsing with Lexical Noise , 2019 .

[49]  Ellen Riloff,et al.  Unsupervised Learning of Contextual Role Knowledge for Coreference Resolution , 2004, NAACL.

[50]  Push Singh,et al.  The Open Mind Common Sense project , 2002 .