Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users' initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval -- through the use of neural contextual language models such as BERT for analysing the documents' and queries' contents and computing their relevance scores -- has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query (e.g. using BERT's [CLS] token), or via multiple representations (e.g. using an embedding for each token of the query and document). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, we extract representative feedback embeddings (using KMeans clustering) -- while ensuring that these embeddings discriminate among passages (based on IDF) -- which are then added to the query representation. These additional feedback embeddings are shown to both enhance the effectiveness of a reranking as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by upto 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed ColBERT-PRF method on a ColBERT dense retrieval approach.

[1]  Mandar Mitra,et al.  Selecting Discriminative Terms for Relevance Model , 2019, SIGIR.

[2]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[3]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[4]  Zhuyun Dai,et al.  PGT: Pseudo Relevance Feedback Using a Graph-Based Transformer , 2021, ECIR.

[5]  Oren Kurland,et al.  Query Expansion Using Word Embeddings , 2016, CIKM.

[6]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[7]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.

[8]  W. Bruce Croft,et al.  Embedding-based Query Language Models , 2016, ICTIR.

[9]  Yingfei Sun,et al.  PARADE: Passage Representation Aggregation forDocument Reranking , 2020, ACM Trans. Inf. Syst..

[10]  Srikanta J. Bedathur,et al.  Using Word Embeddings for Information Retrieval: How Collection and Term Normalization Choices Affect Performance , 2018, CIKM.

[11]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[12]  Sean MacAvaney,et al.  OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline , 2020, WSDM.

[13]  Xianpei Han,et al.  BERT-QE: Contextualized Query Expansion for Document Re-ranking , 2020, FINDINGS.

[14]  Fernando Diaz,et al.  UMass at TREC 2004: Novelty and HARD , 2004, TREC.

[15]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[16]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[17]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[18]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.

[19]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[20]  Xinhui Tu,et al.  A Pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval , 2020, Inf. Process. Manag..

[21]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[22]  Ben He,et al.  NPRF: A Neural Pseudo Relevance Feedback Framework for Ad-hoc Information Retrieval , 2018, EMNLP.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[25]  Gianni Amati,et al.  Probability models for information retrieval based on divergence from randomness , 2003 .

[26]  Utpal Garain,et al.  Using Word Embeddings for Automatic Query Expansion , 2016, ArXiv.

[27]  Craig MacDonald,et al.  Terrier Information Retrieval Platform , 2005, ECIR.

[28]  James P. Callan,et al.  Context-Aware Document Term Weighting for Ad-Hoc Search , 2020, WWW.

[29]  Benjamin Piwowarski,et al.  A White Box Analysis of ColBERT , 2020, ECIR.

[30]  W. Bruce Croft,et al.  A Deep Relevance Matching Model for Ad-hoc Retrieval , 2016, CIKM.

[31]  Craig Macdonald,et al.  Declarative Experimentation in Information Retrieval using PyTerrier , 2020, ICTIR.

[32]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[33]  Claudio Carpineto,et al.  Query Difficulty, Robustness, and Selective Application of Query Expansion , 2004, ECIR.

[34]  Zhuyun Dai,et al.  Rethinking Query Expansion for BERT Reranking , 2020, ECIR.

[35]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.