PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval

Recently, dense passage retrieval has become a mainstream approach to finding relevant information in various natural language processing tasks. A number of studies have been devoted to improving the widely adopted dual-encoder architecture. However, most of the previous studies only consider query-centric similarity relation when learning the dual-encoder retriever. In order to capture more comprehensive similarity relations, we propose a novel approach that leverages both query-centric and PAssage-centric sImilarity Relations (called PAIR) for dense passage retrieval. To implement our approach, we make three major technical contributions by introducing formal formulations of the two kinds of similarity relations, generating high-quality pseudo labeled data via knowledge distillation, and designing an effective two-stage training procedure that incorporates passage-centric similarity relation constraint. Extensive experiments show that our approach significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions datasets1.

[1]  Barbara Plank,et al.  When is multitask learning effective? Semantic sequence prediction under varying data conditions , 2016, EACL.

[2]  Luke Zettlemoyer,et al.  Zero-shot Entity Linking with Dense Entity Retrieval , 2019, ArXiv.

[3]  Yanjun Ma,et al.  PaddlePaddle: An Open-Source Deep Learning Platform from Industrial Practice , 2019 .

[4]  Wenhan Xiong,et al.  Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval , 2020, International Conference on Learning Representations.

[5]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[6]  Ji Ma,et al.  Neural Passage Retrieval with Improved Negative Contrast , 2020, ArXiv.

[7]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[8]  Nick Craswell,et al.  ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search , 2020, CIKM.

[9]  Hang Li,et al.  An Information Retrieval Approach to Short Text Conversation , 2014, ArXiv.

[10]  Quoc V. Le,et al.  BAM! Born-Again Multi-Task Networks for Natural Language Understanding , 2019, ACL.

[11]  Matthew Henderson,et al.  Efficient Natural Language Response Suggestion for Smart Reply , 2017, ArXiv.

[12]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[13]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[16]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[17]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[18]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[19]  Yelong Shen,et al.  Generation-Augmented Retrieval for Open-Domain Question Answering , 2020, ACL.

[20]  Yinfei Yang,et al.  Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation , 2020, ACL.

[21]  Jason Baldridge,et al.  Learning Dense Representations for Entity Retrieval , 2019, CoNLL.

[22]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[25]  Edouard Grave,et al.  Distilling Knowledge from Reader to Retriever for Question Answering , 2020, ArXiv.

[26]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[27]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[28]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[29]  Jamie Callan,et al.  Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[30]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[31]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[32]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[33]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[34]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.