论文信息 - Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval - 字舞流文

Pre-training with Large Language Model-based Document Expansion for Dense Passage Retrieval

In this paper, we systematically study the potential of pre-training with Large Language Model(LLM)-based document expansion for dense passage retrieval. Concretely, we leverage the capabilities of LLMs for document expansion, i.e. query generation, and effectively transfer expanded knowledge to retrievers using pre-training strategies tailored for passage retrieval. These strategies include contrastive learning and bottlenecked query generation. Furthermore, we incorporate a curriculum learning strategy to reduce the reliance on LLM inferences. Experimental results demonstrate that pre-training with LLM-based document expansion significantly boosts the retrieval performance on large-scale web-search tasks. Our work shows strong zero-shot and out-of-domain retrieval abilities, making it more widely applicable for retrieval when initializing with no human-labeled data.

Peng Wang | Zijia Lin | Songlin Hu | Xing Wu | Guangyuan Ma

[1] Xuanhui Wang,et al. Query Expansion by Prompting Large Language Models , 2023, ArXiv.

[2] Yingxia Shao,et al. RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models , 2023, ACL.

[3] Fuzheng Zhang,et al. CoT-MAE v2: Contextual Masked Auto-Encoder with Multi-view Modeling for Passage Retrieval , 2023, ArXiv.

[4] Furu Wei,et al. Query2doc: Query Expansion with Large Language Models , 2023, ArXiv.

[5] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[6] Jimmy J. Lin,et al. Precise Zero-Shot Dense Retrieval without Relevance Labels , 2022, ACL.

[7] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[8] Songlin Hu,et al. Query-as-context Pre-training for Dense Passage Retrieval , 2022, EMNLP.

[9] Shuaiqiang Wang,et al. Pre-trained Language Model-based Retrieval and Ranking for Web Search , 2022, ACM Trans. Web.

[10] Dan Iter,et al. Generate rather than Retrieve: Large Language Models are Strong Context Generators , 2022, ICLR.

[11] Zhongyuan Wang,et al. ConTextual Masked Auto-Encoder for Dense Passage Retrieval , 2022, AAAI.

[12] Furu Wei,et al. SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval , 2022, ACL.

[13] Lemao Liu,et al. Recent Advances in Retrieval-Augmented Text Generation , 2022, SIGIR.

[14] Hua Wu,et al. ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval , 2022, ArXiv.

[15] J. Guo,et al. Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction , 2022, SIGIR.

[16] Noah A. Smith,et al. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , 2022, EMNLP.

[17] Jimmy J. Lin,et al. Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval , 2022, ArXiv.

[18] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[19] M. Zaharia,et al. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , 2021, NAACL.

[20] Wayne Xin Zhao,et al. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking , 2021, EMNLP.

[21] Luyu Gao,et al. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval , 2021, ACL.

[22] Shuaiqiang Wang,et al. Pre-trained Language Model for Web-scale Retrieval in Baidu Search , 2021, KDD.

[23] Danqi Chen,et al. SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[24] Iryna Gurevych,et al. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[25] Jamie Callan,et al. Condenser: a Pre-training Architecture for Dense Retrieval , 2021, EMNLP.

[26] Nick Craswell,et al. Overview of the TREC 2020 Deep Learning Track , 2021, TREC.

[27] Hua Wu,et al. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[28] Fabio Petroni,et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[29] M. Zaharia,et al. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[30] Danqi Chen,et al. Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[31] Bhaskar Mitra,et al. Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[32] Wei-Cheng Chang,et al. Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[33] Vishrav Chaudhary,et al. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[34] Sadao Kurohashi,et al. FAQ Retrieval using Query-Question Similarity and BERT-Based Query-Answer Relevance , 2019, SIGIR.

[35] Jimmy J. Lin,et al. Document Expansion by Query Prediction , 2019, ArXiv.

[36] Jianfeng Gao,et al. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[37] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[38] Hugo Zaragoza,et al. The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[39] W. Bruce Croft,et al. Relevance-Based Language Models , 2001, SIGIR '01.

[40] Wayne Xin Zhao,et al. MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are Better Dense Retrievers , 2022, ECML/PKDD.

[41] Yingxia Shao,et al. RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder , 2022, ArXiv.

[42] Wonsuk Yang,et al. Query Generation with External Knowledge for Dense Retrieval , 2022, DEELIO.

[43] Paul N. Bennett,et al. Less is More: Pretrain a Strong Siamese Encoder for Dense Text Retrieval Using a Weak Decoder , 2021, EMNLP.

[44] Edouard Grave,et al. Towards Unsupervised Dense Information Retrieval with Contrastive Learning , 2021, ArXiv.

[45] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[46] Simon Regard,et al. ["Less is more"]. , 2013, Revue medicale suisse.

[47] Nathan Schneider,et al. Association for Computational Linguistics: Human Language Technologies , 2011 .