RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for finding answers. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and inference, the existence of unlabeled positives and limited training data. To address these challenges, we propose an optimized training approach, called RocketQA, to improving dense passage retrieval. We make three major technical contributions in RocketQA, namely cross-batch negatives, denoised hard negatives and data augmentation. The experiment results show that RocketQA significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions. We also conduct extensive experiments to examine the effectiveness of the three strategies in RocketQA. Besides, we demonstrate that the performance of end-to-end QA can be improved based on our RocketQA retriever.

[1]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2020, Transactions of the Association for Computational Linguistics.

[2]  Linjun Yang,et al.  Embedding-based Retrieval in Facebook Search , 2020, KDD.

[3]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[4]  Shuguang Han,et al.  Learning-to-Rank with BERT in TF-Ranking , 2020, ArXiv.

[5]  Yanjun Ma,et al.  PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice , 2019 .

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[8]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[9]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[10]  Danqi Chen,et al.  MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension , 2019, EMNLP.

[11]  Jimmy J. Lin,et al.  The Neural Hype and Comparisons Against Weak Baselines , 2019, SIGIR Forum.

[12]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[13]  Danqi Chen,et al.  A Discrete Hard EM Approach for Weakly Supervised Question Answering , 2019, EMNLP.

[14]  James Demmel,et al.  Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.

[15]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[16]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[17]  Matthew Henderson,et al.  Efficient Natural Language Response Suggestion for Smart Reply , 2017, ArXiv.

[18]  Danqi Chen,et al.  Knowledge Guided Text Retrieval and Reading for Open Domain Question Answering , 2019, ArXiv.

[19]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[20]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[21]  Luke Zettlemoyer,et al.  Zero-shot Entity Linking with Dense Entity Retrieval , 2019, ArXiv.

[22]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[23]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[24]  Jimmy J. Lin,et al.  Multi-Stage Document Ranking with BERT , 2019, ArXiv.

[25]  Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering , 2019, ICLR.

[26]  Jason Baldridge,et al.  Learning Dense Representations for Entity Retrieval , 2019, CoNLL.

[27]  Ping Li,et al.  Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.

[28]  Luyu Gao,et al.  Modularized Transfomer-based Ranking Framework , 2020, EMNLP.

[29]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[30]  Jamie Callan,et al.  Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[31]  Yelong Shen,et al.  Generation-Augmented Retrieval for Open-Domain Question Answering , 2020, ACL.

[32]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[33]  Jimmy J. Lin,et al.  Overview of the TREC 2007 Question Answering Track , 2008, TREC.

[34]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[35]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[36]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[37]  Chenliang Li,et al.  IDST at TREC 2019 Deep Learning Track: Deep Cascade Ranking with Generation-based Document Expansion and Pre-trained Language Modeling , 2019, TREC.

[38]  Susan T. Dumais,et al.  An Analysis of the AskMSR Question-Answering System , 2002, EMNLP.

[39]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[40]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[41]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[42]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[43]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[44]  Nick Craswell,et al.  ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search , 2020, CIKM.

[45]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[46]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[47]  Nick Craswell,et al.  Learning to Match using Local and Distributed Representations of Text for Web Search , 2016, WWW.

[48]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[49]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[51]  Jimmy J. Lin,et al.  Document Ranking with a Pretrained Sequence-to-Sequence Model , 2020, FINDINGS.