Questions Are All You Need to Train a Dense Passage Retriever

We introduce ART, a new corpus-level au-toencoding approach for training dense retrieval models that does not require any labeled training data. Dense retrieval is a cen-tral challenge for open-domain tasks, such as Open QA, where state-of-the-art methods typically require large supervised datasets with custom hard-negative mining and denoising of positive examples. ART, in contrast, only requires access to unpaired inputs and out-puts (e.g. questions and potential answer doc-uments). It uses a new document-retrieval au-toencoding scheme, where (1) an input question is used to retrieve a set of evidence documents, and (2) the documents are then used to compute the probability of reconstructing the original question. Training for retrieval based on question reconstruction enables effective unsupervised learning of both document and question encoders, which can be later incorpo-rated into complete Open QA systems without any further finetuning. Extensive experiments demonstrate that ART obtains state-of-the-art results on multiple QA retrieval benchmarks with only generic initialization from a pretrained language model, removing the need for labeled data and task-specific losses. 1

[1]  Arun Tejasvi Chaganty,et al.  Dialog Inpainting: Turning Documents into Dialogs , 2022, ICML.

[2]  Devendra Singh Sachan,et al.  Improving Passage Retrieval with Zero-Shot Question Generation , 2022, ArXiv.

[3]  Rodrigo Nogueira,et al.  InPars: Data Augmentation for Information Retrieval using Large Language Models , 2022, ArXiv.

[4]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[5]  Peter Welinder,et al.  Text and Code Embeddings by Contrastive Pre-Training , 2022, ArXiv.

[6]  Omer Levy,et al.  Learning to Retrieve Passages without Supervision , 2021, NAACL.

[7]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[8]  Danqi Chen,et al.  Simple Entity-Centric Questions Challenge Dense Retrievers , 2021, EMNLP.

[9]  Dani Yogatama,et al.  End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering , 2021, NeurIPS.

[10]  William L. Hamilton,et al.  End-to-End Training of Neural Retrievers for Open-Domain Question Answering , 2021, ACL.

[11]  Edouard Grave,et al.  Distilling Knowledge from Reader to Retriever for Question Answering , 2020, ICLR.

[12]  Kenton Lee,et al.  XOR QA: Cross-lingual Open-Retrieval Question Answering , 2020, NAACL.

[13]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[14]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[15]  Yelong Shen,et al.  Generation-Augmented Retrieval for Open-Domain Question Answering , 2020, ACL.

[16]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[17]  Christopher Potts,et al.  Relevance-guided Supervision for OpenQA with ColBERT , 2020, Transactions of the Association for Computational Linguistics.

[18]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2020, Transactions of the Association for Computational Linguistics.

[19]  Edouard Grave,et al.  Towards Unsupervised Dense Information Retrieval with Contrastive Learning , 2021, ArXiv.

[20]  Sashank J. Reddi,et al.  Efficient Training of Retrieval Models using Negative Cache , 2021, NeurIPS.

[21]  Ryan T. McDonald,et al.  Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation , 2021, EACL.

[22]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[23]  Jason Weston,et al.  Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2020, ICLR.

[24]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[25]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[26]  Jimmy J. Lin,et al.  Document Ranking with a Pretrained Sequence-to-Sequence Model , 2020, FINDINGS.

[27]  Eunsol Choi,et al.  TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages , 2020, Transactions of the Association for Computational Linguistics.

[28]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[29]  Jason Baldridge,et al.  Learning Dense Representations for Entity Retrieval , 2019, CoNLL.

[30]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[31]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[36]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[37]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[38]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[39]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[40]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..