Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval

In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.

[1]  Xuanhui Wang,et al.  Query Expansion by Prompting Large Language Models , 2023, ArXiv.

[2]  Yingxia Shao,et al.  RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models , 2023, ACL.

[3]  Fuzheng Zhang,et al.  CoT-MAE v2: Contextual Masked Auto-Encoder with Multi-view Modeling for Passage Retrieval , 2023, ArXiv.

[4]  Furu Wei,et al.  Query2doc: Query Expansion with Large Language Models , 2023, ArXiv.

[5]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6]  Jimmy J. Lin,et al.  Precise Zero-Shot Dense Retrieval without Relevance Labels , 2022, ACL.

[7]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[8]  Songlin Hu,et al.  Query-as-context Pre-training for Dense Passage Retrieval , 2022, EMNLP.

[9]  Shuaiqiang Wang,et al.  Pre-trained Language Model-based Retrieval and Ranking for Web Search , 2022, ACM Trans. Web.

[10]  Dan Iter,et al.  Generate rather than Retrieve: Large Language Models are Strong Context Generators , 2022, ICLR.

[11]  Zhongyuan Wang,et al.  ConTextual Masked Auto-Encoder for Dense Passage Retrieval , 2022, AAAI.

[12]  Furu Wei,et al.  SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval , 2022, ACL.

[13]  Lemao Liu,et al.  Recent Advances in Retrieval-Augmented Text Generation , 2022, SIGIR.

[14]  Hua Wu,et al.  ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval , 2022, ArXiv.

[15]  J. Guo,et al.  Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction , 2022, SIGIR.

[16]  Noah A. Smith,et al.  Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , 2022, EMNLP.

[17]  Jimmy J. Lin,et al.  Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval , 2022, ArXiv.

[18]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[19]  M. Zaharia,et al.  ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , 2021, NAACL.

[20]  Wayne Xin Zhao,et al.  RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking , 2021, EMNLP.

[21]  Luyu Gao,et al.  Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval , 2021, ACL.

[22]  Shuaiqiang Wang,et al.  Pre-trained Language Model for Web-scale Retrieval in Baidu Search , 2021, KDD.

[23]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[24]  Iryna Gurevych,et al.  BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[25]  Jamie Callan,et al.  Condenser: a Pre-training Architecture for Dense Retrieval , 2021, EMNLP.

[26]  Nick Craswell,et al.  Overview of the TREC 2020 Deep Learning Track , 2021, TREC.

[27]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[28]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[29]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[30]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[31]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[32]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[33]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[34]  Sadao Kurohashi,et al.  FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance , 2019, SIGIR.

[35]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[36]  Jianfeng Gao,et al.  MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[37]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[38]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[39]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[40]  Wayne Xin Zhao,et al.  MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers , 2022, ECML/PKDD.

[41]  Yingxia Shao,et al.  RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder , 2022, ArXiv.

[42]  Wonsuk Yang,et al.  Query Generation with External Knowledge for Dense Retrieval , 2022, DEELIO.

[43]  Paul N. Bennett,et al.  Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder , 2021, EMNLP.

[44]  Edouard Grave,et al.  Towards Unsupervised Dense Information Retrieval with Contrastive Learning , 2021, ArXiv.

[45]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46]  Simon Regard,et al.  ["Less is more"]. , 2013, Revue medicale suisse.

[47]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .