SituatedQA: Incorporating Extra-Linguistic Contexts into QA

Answers to the same question may change depending on the extra-linguistic contexts (when and where the question was asked). To study this challenge, we introduce SITUATEDQA, an open-retrieval QA dataset where systems must produce the correct answer to a question given the temporal or geographical context. To construct SITUATEDQA, we first identify such questions in existing QA datasets. We find that a significant proportion of information seeking questions have context-dependent answers (e.g., roughly 16.5% of NQ-Open). For such context-dependent questions, we then crowdsource alternative contexts and their corresponding answers. Our study shows that existing models struggle with producing answers that are frequently updated or from uncommon locations. We further quantify how existing models, which are trained on data collected in the past, fail to generalize to answering questions asked in the present, even when provided with an updated evidence corpus (a roughly 15 point drop in accuracy). Our analysis suggests that open-retrieval QA benchmarks should incorporate extra-linguistic context to stay relevant globally and in the future. Our data, code, and datasheet are available at https: //situatedqa.github.io/.

[1]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[2]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[3]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[4]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[5]  Mark Steedman,et al.  Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL) , 2015 .

[6]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[7]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[8]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[9]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[10]  D. Sculley,et al.  No Classification without Representation: Assessing Geodiversity Issues in Open Data Sets for the Developing World , 2017, 1711.08536.

[11]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[12]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[13]  Dan Roth,et al.  Temporal Reasoning on Implicit Events from Distant Supervision , 2020, NAACL.

[14]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[15]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[16]  William W. Cohen,et al.  Time-Aware Language Models as Temporal Knowledge Bases , 2021, International Conference on Topology, Algebra and Categories in Logic.

[17]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[18]  Shruti Rijhwani,et al.  Temporally-Informed Analysis of Named Entity Recognition , 2020, ACL.

[19]  Jonathan Berant,et al.  MultiModalQA: Complex Question Answering over Text, Tables and Images , 2021, ICLR.

[20]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[21]  James Pustejovsky,et al.  SemEval-2015 Task 5: QA TempEval - Evaluating Temporal Information Understanding with Question Answering , 2015, *SEMEVAL.

[22]  James Pustejovsky,et al.  TimeML: Robust Specification of Event and Temporal Expressions in Text , 2003, New Directions in Question Answering.

[23]  Daniel S. Weld,et al.  Temporal Information Extraction , 2010, AAAI.

[24]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[25]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[26]  Brendan T. O'Connor,et al.  Learning to Extract Events from Knowledge Base Revisions , 2017, WWW.

[27]  Michael J. Paul,et al.  Examining Temporality in Document Classification , 2018, ACL.

[28]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[29]  Dan Roth,et al.  Temporal Common Sense Acquisition with Minimal Supervision , 2020, ACL.

[30]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[31]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[32]  James H. Martin,et al.  Timelines from Text: Identification of Syntactic Temporal Relations , 2007, International Conference on Semantic Computing (ICSC 2007).

[33]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[34]  Chandra Bhagavatula,et al.  Semi-supervised sequence tagging with bidirectional language models , 2017, ACL.

[35]  Christopher Potts,et al.  DynaSent: A Dynamic Benchmark for Sentiment Analysis , 2020, ACL.

[36]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[37]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[38]  Raymond J. Mooney,et al.  Statistical Script Learning with Multi-Argument Events , 2014, EACL.

[39]  Wenhu Chen,et al.  A Dataset for Answering Time-Sensitive Questions , 2021, NeurIPS Datasets and Benchmarks.

[40]  Hinrich Schutze,et al.  Negated LAMA: Birds cannot fly , 2019, ArXiv.

[41]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[42]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[43]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[44]  Julian Michael,et al.  AmbigQA: Answering Ambiguous Open-domain Questions , 2020, EMNLP.

[45]  Taylor Cassidy,et al.  An Annotation Framework for Dense Event Ordering , 2014, ACL.

[46]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[47]  Livio Baldini Soares,et al.  Entities as Experts: Sparse Memory Access with Entity Supervision , 2020, EMNLP.

[48]  Nicola De Cao,et al.  NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned , 2021, NeurIPS.

[49]  Yejin Choi,et al.  COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs , 2020, AAAI.

[50]  Soumen Chakrabarti,et al.  Question Answering Over Temporal Knowledge Graphs , 2021, ACL.

[51]  Chong Wang,et al.  Dynamic Language Models for Streaming Text , 2014, TACL.

[52]  David Bamman,et al.  Literary Event Detection , 2019, ACL.

[53]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[54]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55]  Benjamin Van Durme,et al.  Fine-Grained Temporal Relation Extraction , 2019, ACL.

[56]  Ellie Pavlick,et al.  Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering , 2021, ACL.

[57]  Ashwin Lall,et al.  Exponential Reservoir Sampling for Streaming Language Models , 2014, ACL.

[58]  Regina Barzilay,et al.  Get Your Vitamin C! Robust Fact Verification with Contrastive Evidence , 2021, NAACL.