RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

In various natural language processing tasks, passage retrieval and passage re-ranking are two key procedures in finding and ranking relevant information. Since both the two procedures contribute to the final performance, it is important to jointly optimize them in order to achieve mutual improvement. In this paper, we propose a novel joint training approach for dense passage retrieval and passage reranking. A major contribution is that we introduce the dynamic listwise distillation, where we design a unified listwise training approach for both the retriever and the re-ranker. During the dynamic distillation, the retriever and the re-ranker can be adaptively improved according to each other’s relevance information. We also propose a hybrid data augmentation strategy to construct diverse training instances for listwise training approach. Extensive experiments show the effectiveness of our approach on both MSMARCO and Natural Questions datasets. Our code is available at https:// github.com/PaddlePaddle/RocketQA.

[1]  Wenhan Xiong,et al.  Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval , 2020, International Conference on Learning Representations.

[2]  Luyu Gao,et al.  Modularized Transfomer-based Ranking Framework , 2020, EMNLP.

[3]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[4]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[5]  Yelong Shen,et al.  Reader-Guided Passage Reranking for Open-Domain Question Answering , 2021, FINDINGS.

[6]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[7]  Yelong Shen,et al.  Generation-Augmented Retrieval for Open-Domain Question Answering , 2020, ACL.

[8]  Luyu Gao,et al.  COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List , 2021, NAACL.

[9]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[10]  Matthew Henderson,et al.  Efficient Natural Language Response Suggestion for Smart Reply , 2017, ArXiv.

[11]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[12]  Edouard Grave,et al.  Distilling Knowledge from Reader to Retriever for Question Answering , 2020, ArXiv.

[13]  Xuanhui Wang,et al.  Learning-to-Rank with BERT in TF-Ranking , 2020, ArXiv.

[14]  Yiqun Liu,et al.  RepBERT: Contextualized Text Embeddings for First-Stage Retrieval , 2020, ArXiv.

[15]  Allan Hanbury,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[16]  Luyu Gao,et al.  Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline , 2021, ECIR.

[17]  Allan Hanbury,et al.  Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation , 2020, ArXiv.

[18]  Jamie Callan,et al.  Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[19]  Hua Wu,et al.  PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval , 2021, FINDINGS.

[20]  Chenliang Li,et al.  IDST at TREC 2019 Deep Learning Track: Deep Cascade Ranking with Generation-based Document Expansion and Pre-trained Language Modeling , 2019, TREC.

[21]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[22]  Jiafeng Guo,et al.  Optimizing Dense Retrieval Model Training with Hard Negatives , 2021, SIGIR.

[23]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[24]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[25]  Yi Tay,et al.  Rethinking Search: Making Experts out of Dilettantes , 2021, ArXiv.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Sohee Yang,et al.  Is Retriever Merely an Approximator of Reader? , 2020, ArXiv.

[28]  Linjun Yang,et al.  Embedding-based Retrieval in Facebook Search , 2020, KDD.

[29]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[30]  Luke Zettlemoyer,et al.  Zero-shot Entity Linking with Dense Entity Retrieval , 2019, ArXiv.

[31]  Sung-Hyon Myaeng,et al.  UHD-BERT: Bucketed Ultra-High Dimensional Sparse Representations for Full Ranking , 2021, ArXiv.

[32]  William L. Hamilton,et al.  End-to-End Training of Neural Retrievers for Open-Domain Question Answering , 2021, ACL.

[33]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[34]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[35]  Yanjun Ma,et al.  PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice , 2019 .

[36]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[37]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[38]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[39]  Zhiyuan Liu,et al.  Understanding the Behaviors of BERT in Ranking , 2019, ArXiv.

[40]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.

[41]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[42]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[43]  Hang Li,et al.  An Information Retrieval Approach to Short Text Conversation , 2014, ArXiv.

[44]  Ji Ma,et al.  Neural Passage Retrieval with Improved Negative Contrast , 2020, ArXiv.

[45]  Yinfei Yang,et al.  Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation , 2020, ACL.

[46]  Jason Baldridge,et al.  Learning Dense Representations for Entity Retrieval , 2019, CoNLL.

[47]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[48]  Jimmy J. Lin,et al.  Multi-Stage Document Ranking with BERT , 2019, ArXiv.