Challenges in Generalization in Open Domain Question Answering

Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is as of yet unclear which aspects of novel questions that make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (compgen), and novel entity generalization (novelentity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set – indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities, they struggle with those requiring compositional generalization. Through thorough analysis we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.

[1]  Christopher Potts,et al.  Relevance-guided Supervision for OpenQA with ColBERT , 2020, Transactions of the Association for Computational Linguistics.

[2]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[3]  Jimmy J. Lin,et al.  End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[4]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[5]  Nicola De Cao,et al.  NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned , 2021, NeurIPS.

[6]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[7]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[8]  Kentaro Inui,et al.  What Makes Reading Comprehension Questions Easier? , 2018, EMNLP.

[9]  A Benchmark for Systematic Generalization in Grounded Language Understanding , 2020, NeurIPS.

[10]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[11]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.

[12]  David Berthelot,et al.  WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia , 2016, ACL.

[13]  Guido Zuccon,et al.  BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval , 2021, ICTIR.

[14]  Aurko Roy,et al.  Hurdles to Progress in Long-form Question Answering , 2021, NAACL.

[15]  Sebastian Riedel,et al.  Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets , 2020, EACL.

[16]  Brian M. Sadler,et al.  Beyond I.I.D.: Three Levels of Generalization for Question Answering on Knowledge Bases , 2020, WWW.

[17]  Iryna Gurevych,et al.  BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[18]  Jun Suzuki,et al.  Question Classification using HDAG Kernel , 2003, ACL 2003.

[19]  Matthias Gallé,et al.  To Annotate or Not? Predicting Performance Drop under Domain Shift , 2019, EMNLP.

[20]  Ankur Taly,et al.  Did the Model Understand the Question? , 2018, ACL.

[21]  Benno Krojer,et al.  Are Pretrained Language Models Symbolic Reasoners over Knowledge? , 2020, CONLL.

[22]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Julian Michael,et al.  AmbigQA: Answering Ambiguous Open-domain Questions , 2020, EMNLP.

[25]  Jimmy J. Lin,et al.  Simple BERT Models for Relation Extraction and Semantic Role Labeling , 2019, ArXiv.

[26]  Philipp Dufter,et al.  Static Embeddings as Efficient Knowledge Bases? , 2021, NAACL.

[27]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[28]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[29]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[30]  Jimmy J. Lin,et al.  A Replication Study of Dense Passage Retriever , 2021, ArXiv.

[31]  M. Haggag,et al.  The Question Answering Systems : A Survey . , 2016 .

[32]  Isabelle Augenstein,et al.  QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension , 2021, ACM Computing Surveys.

[33]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[34]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[35]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[36]  Xiao Wang,et al.  Measuring Compositional Generalization: A Comprehensive Method on Realistic Data , 2019, ICLR.

[37]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[38]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2020, BLACKBOXNLP.

[39]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[40]  Xiao Ling,et al.  Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP , 2021, ACL.

[41]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[42]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[43]  Eunsol Choi,et al.  MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, MRQA@EMNLP.

[44]  Martin M. Soubbotin Patterns of Potential Answer Expressions as Clues to the Right Answers , 2001, TREC.

[45]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[47]  Yuxiang Wu,et al.  PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them , 2021, Transactions of the Association for Computational Linguistics.

[48]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[49]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[50]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[51]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[52]  Aaron C. Courville,et al.  Systematic Generalization: What Is Required and Can It Be Learned? , 2018, ICLR.

[53]  Wen-tau Yih,et al.  Efficient One-Pass End-to-End Entity Linking for Questions , 2020, EMNLP.

[54]  Jonathan Berant,et al.  MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.

[55]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[56]  Soujanya Poria,et al.  Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering , 2021, ArXiv.

[57]  Elia Bruni,et al.  The paradox of the compositionality of natural language: a neural machine translation case study , 2021, ArXiv.

[58]  Greg Durrett,et al.  Understanding Dataset Design Choices for Multi-hop Reasoning , 2019, NAACL.

[59]  Zhiyuan Liu,et al.  Denoising Distantly Supervised Open-Domain Question Answering , 2018, ACL.