Curriculum Sampling for Dense Retrieval with Document Expansion

The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent work expects to get query-informed representations of documents. During training, it expands the document with a real query, while replacing the real query with a generated pseudo query at inference. This discrepancy between training and inference makes the dense retrieval model pay more attention to the query information but ignore the document when computing the document representation. As a result, it even performs worse than the vanilla dense retrieval model, since its performance depends heavily on the relevance between the generated queries and the real query. In this paper, we propose a curriculum sampling strategy, which also resorts to the pseudo query at training and gradually increases the relevance of the generated query to the real query. In this way, the retrieval model can learn to extend its attention from the document only to both the document and query, hence getting high-quality query-informed document representations. Experimental results on several passage retrieval datasets show that our approach outperforms the previous dense retrieval methods1.

[1]  Nan Duan,et al.  Metric-guided Distillation: Distilling Knowledge from the Metric to Ranker and Retriever for Generative Commonsense Reasoning , 2022, EMNLP.

[2]  Furu Wei,et al.  Learning Diverse Document Representations with Deep Query Interactions for Dense Retrieval , 2022, ArXiv.

[3]  Furu Wei,et al.  SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval , 2022, ACL.

[4]  Yingxia Shao,et al.  RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder , 2022, EMNLP.

[5]  Weizhu Chen,et al.  Adversarial Retriever-Ranker for dense text retrieval , 2021, ICLR.

[6]  Luyu Gao,et al.  Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval , 2021, ACL.

[7]  Wayne Xin Zhao,et al.  RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking , 2021, EMNLP.

[8]  Jamie Callan,et al.  Condenser: a Pre-training Architecture for Dense Retrieval , 2021, EMNLP.

[9]  Jiafeng Guo,et al.  Optimizing Dense Retrieval Model Training with Hard Negatives , 2021, SIGIR.

[10]  Luyu Gao,et al.  COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List , 2021, NAACL.

[11]  Jimmy J. Lin,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[12]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[13]  Yelong Shen,et al.  Generation-Augmented Retrieval for Open-Domain Question Answering , 2020, ACL.

[14]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[15]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[16]  Paul N. Bennett,et al.  Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder , 2021, EMNLP.

[17]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[18]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[19]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[20]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[21]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[22]  Nick Craswell,et al.  O VERVIEW OF THE TREC 2019 DEEP LEARNING TRACK , 2020 .

[23]  Jamie Callan,et al.  Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval , 2019, arXiv.org.

[24]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[25]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[26]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[27]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[28]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Andreas Vlachos,et al.  The Fact Extraction and VERification (FEVER) Shared Task , 2018, FEVER@EMNLP.

[31]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[32]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[33]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[34]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.