MultiReQA: A Cross-Domain Evaluation forRetrieval Question Answering Models

Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus (Ahmad et al.,2019).This paper presents MultiReQA, anew multi-domain ReQA evaluation suite com-posed of eight retrieval QA tasks drawn from publicly available QA datasets. We provide the first systematic retrieval based evaluation over these datasets using two supervised neural models, based on fine-tuning BERT andUSE-QA models respectively, as well as a surprisingly strong information retrieval baseline,BM25. Five of these tasks contain both train-ing and test data, while three contain test data only. Performance on the five tasks with train-ing data shows that while a general model covering all domains is achievable, the best performance is often obtained by training exclusively on in-domain data.

[1]  Ali Farhadi,et al.  Phrase-Indexed Question Answering: A New Challenge for Scalable Document Comprehension , 2018, EMNLP.

[2]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[3]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[4]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[5]  M VoorheesEllen The TREC question answering track , 2001 .

[6]  Jonghyun Choi,et al.  Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Wolfgang Nejdl,et al.  Using ODP metadata to personalize search , 2005, SIGIR '05.

[8]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[9]  Ray Kurzweil,et al.  Multilingual Universal Sentence Encoder for Semantic Retrieval , 2019, ACL.

[10]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[11]  Georgios Balikas,et al.  An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition , 2015, BMC Bioinformatics.

[12]  Keith Stevens,et al.  Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , 2018, WMT.

[13]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[14]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[15]  David Hawking,et al.  Challenges in Enterprise Search , 2004, ADC.

[16]  Petr Baudis,et al.  Modeling of the Question Answering Task in the YodaQA System , 2015, CLEF.

[17]  Yi Yang,et al.  WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[20]  Noah Constant,et al.  LAReQA: Language-agnostic Answer Retrieval from a Multilingual Pool , 2020, EMNLP.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[23]  Daniel Gillick,et al.  End-to-End Retrieval in Continuous Space , 2018, ArXiv.

[24]  Ji Ma,et al.  Zero-shot Neural Retrieval via Domain-targeted Synthetic Query Generation , 2020, ArXiv.

[25]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.

[26]  Mihai Surdeanu,et al.  Learning to Rank Answers on Large Online QA Collections , 2008, ACL.

[27]  Kyunghyun Cho,et al.  SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , 2017, ArXiv.

[28]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[29]  Eunsol Choi,et al.  MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, MRQA@EMNLP.

[30]  Naveen Arivazhagan,et al.  Language-agnostic BERT Sentence Embedding , 2020, ArXiv.

[31]  Harris Wu,et al.  Evaluating Web-based Question Answering Systems , 2002, LREC.

[32]  Ellen M. Voorhees,et al.  Building a question answering test collection , 2000, SIGIR '00.

[33]  Ali Farhadi,et al.  Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index , 2019, ACL.

[34]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[35]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[36]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[37]  Daniel Petrie,et al.  Are You Smarter Than a Fifth Grader , 2010 .

[38]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[39]  Noah Constant,et al.  ReQA: An Evaluation for End-to-End Answer Retrieval Models , 2019, EMNLP.