On Approximate Nearest Neighbour Selection for Multi-Stage Dense Retrieval

Dense retrieval, which describes the use of contextualised language models such as BERT to identify documents from a collection by leveraging approximate nearest neighbour (ANN) techniques, has been increasing in popularity. Two families of approaches have emerged, depending on whether documents and queries are represented by single or multiple embeddings. ColBERT, the exemplar of the latter, uses an ANN index and approximate scores to identify a set of candidate documents for each query embedding, which are then re-ranked using accurate document representations. In this manner, a large number of documents can be retrieved for each query, hindering the efficiency of the approach. In this work, we investigate the use of ANN scores for ranking the candidate documents, in order to decrease the number of candidate documents being fully scored. Experiments conducted on the MSMARCO passage ranking corpus demonstrate that, by cutting of the candidate set by using the approximate scores to only 200 documents, we can still obtain an effective ranking without statistically significant differences in effectiveness, and resulting in a 2x speedup in efficiency.

[1]  Craig Macdonald,et al.  Declarative Experimentation in Information Retrieval using PyTerrier , 2020, ICTIR.

[2]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[3]  James P. Callan,et al.  Context-Aware Document Term Weighting for Ad-Hoc Search , 2020, WWW.

[4]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[5]  Benjamin Piwowarski,et al.  A White Box Analysis of ColBERT , 2020, ECIR.

[6]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[7]  Craig Macdonald,et al.  Query Embedding Pruning for Dense Retrieval , 2021, CIKM.

[8]  Nazli Goharian,et al.  CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[9]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[10]  Allan Hanbury,et al.  On the Effect of Low-Frequency Terms on Neural-IR Models , 2019, SIGIR.

[11]  Raffaele Perego,et al.  Expansion via Prediction of Importance with Contextualization , 2020, SIGIR.

[12]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[13]  Craig MacDonald,et al.  PyTerrier: Declarative Experimentation in Python from BM25 to Dense Retrieval , 2021, CIKM.

[14]  M. Zaharia,et al.  ColBERT , 2020, Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.

[15]  Arman Cohan,et al.  CEDR , 2019, Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval.

[16]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[17]  Jason Weston,et al.  Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2019 .

[18]  Rodrygo L. T. Santos,et al.  The whens and hows of learning to rank for web search , 2012, Information Retrieval.

[19]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.

[20]  Christopher J. C. Burges,et al.  High accuracy retrieval with multiple nested ranker , 2006, SIGIR.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[23]  A. Hanbury,et al.  Learning to Re-Rank with Contextualized Stopwords , 2020, CIKM.