Learning to Tokenize for Generative Retrieval

Conventional document retrieval techniques are mainly based on the index-retrieve paradigm. It is challenging to optimize pipelines based on this paradigm in an end-to-end manner. As an alternative, generative retrieval represents documents as identifiers (docid) and retrieves documents by generating docids, enabling end-to-end modeling of document retrieval tasks. However, it is an open question how one should define the document identifiers. Current approaches to the task of defining document identifiers rely on fixed rule-based docids, such as the title of a document or the result of clustering BERT embeddings, which often fail to capture the complete semantic information of a document. We propose GenRet, a document tokenization learning method to address the challenge of defining document identifiers for generative retrieval. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. Three components are included in GenRet: (i) a tokenization model that produces docids for documents; (ii) a reconstruction model that learns to reconstruct a document based on a docid; and (iii) a sequence-to-sequence retrieval model that generates relevant document identifiers directly for a designated query. By using an auto-encoding framework, GenRet learns semantic docids in a fully end-to-end manner. We also develop a progressive training scheme to capture the autoregressive nature of docids and to stabilize training. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets to assess the effectiveness of GenRet. GenRet establishes the new state-of-the-art on the NQ320K dataset. Especially, compared to generative retrieval baselines, GenRet can achieve significant improvements on the unseen documents. GenRet also outperforms comparable baselines on MS MARCO and BEIR, demonstrating the method's generalizability.

[1]  Sanket Vaibhav Mehta,et al.  DSI++: Updating Transformer Memory with New Documents , 2022, EMNLP.

[2]  Yi Chang,et al.  Learning Semantic Textual Similarity via Topic-informed Discrete Latent Variables , 2022, EMNLP.

[3]  Minjoon Seo,et al.  Contextualized Generative Retrieval , 2022, ArXiv.

[4]  Ledell Yu Wu,et al.  Ultron: An Ultimate Retriever on Corpus with a Model-based Indexer , 2022, ArXiv.

[5]  J. Guo,et al.  CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks , 2022, CIKM.

[6]  Daxin Jiang,et al.  Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation , 2022, ArXiv.

[7]  Qi Zhang,et al.  A Neural Corpus Indexer for Document Retrieval , 2022, NeurIPS.

[8]  Hua Wu,et al.  ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval , 2022, ArXiv.

[9]  Wen-tau Yih,et al.  Autoregressive Search Engines: Generating Substrings as Document Identifiers , 2022, NeurIPS.

[10]  William W. Cohen,et al.  Transformer Memory as a Differentiable Search Index , 2022, NeurIPS.

[11]  Peter Welinder,et al.  Text and Code Embeddings by Contrastive Pre-Training , 2022, ArXiv.

[12]  Edouard Grave,et al.  Unsupervised Dense Information Retrieval with Contrastive Learning , 2021, Trans. Mach. Learn. Res..

[13]  Keith B. Hall,et al.  Large Dual Encoders Are Generalizable Retrievers , 2021, EMNLP.

[14]  Nils Reimers,et al.  GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval , 2021, NAACL.

[15]  M. Zaharia,et al.  ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , 2021, NAACL.

[16]  Jiafeng Guo,et al.  Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval , 2021, WSDM.

[17]  Keith B. Hall,et al.  Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models , 2021, FINDINGS.

[18]  Shuaiqiang Wang,et al.  Pre-trained Language Model for Web-scale Retrieval in Baidu Search , 2021, KDD.

[19]  Iryna Gurevych,et al.  BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models , 2021, NeurIPS Datasets and Benchmarks.

[20]  Jimmy J. Lin,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[21]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[22]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[23]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[24]  Nicola De Cao,et al.  Autoregressive Entity Retrieval , 2020, ICLR.

[25]  Nicola De Cao,et al.  KILT: a Benchmark for Knowledge Intensive Language Tasks , 2020, NAACL.

[26]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[27]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[28]  Donald Metzler,et al.  Rethinking Search: Making Domain Experts out of Dilettantes ∗ , 2021 .

[29]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[30]  Karl Stratos,et al.  Discrete Latent Variable Representations for Low-Resource Text Classification , 2020, ACL.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[33]  James P. Callan,et al.  Context-Aware Document Term Weighting for Ad-Hoc Search , 2020, WWW.

[34]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[35]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[36]  Jamie Callan,et al.  Context-Aware Sentence/Passage Term Importance Estimation For First Stage Retrieval , 2019, arXiv.org.

[37]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[38]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[39]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[40]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[41]  Daniel Gillick,et al.  End-to-End Retrieval in Continuous Space , 2018, ArXiv.

[42]  Maxine Eskénazi,et al.  Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation , 2018, ACL.

[43]  Aurko Roy,et al.  Fast Decoding in Sequence Models using Discrete Latent Variables , 2018, ICML.

[44]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[45]  W. Bruce Croft,et al.  Neural Ranking Models with Weak Supervision , 2017, SIGIR.

[46]  Jason Tyler Rolfe,et al.  Discrete Variational Autoencoders , 2016, ICLR.

[47]  John D. Lafferty,et al.  Document Language Models, Query Models, and Risk Minimization for Information Retrieval , 2001, SIGIR Forum.

[48]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[49]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[50]  James P. Callan,et al.  Learning to Reweight Terms with Distributed Representations , 2015, SIGIR.

[51]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[52]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[53]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[54]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.