In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval

We present an efficient training approach to text retrieval with dense representations that applies knowledge distillation using the ColBERT late-interaction ranking model. Specifically, we propose to transfer the knowledge from a bi-encoder teacher to a student by distilling knowledge from ColBERT’s expressive MaxSim operator into a simple dot product. The advantage of the bi-encoder teacher–student setup is that we can efficiently add in-batch negatives during knowledge distillation, enabling richer interactions between teacher and student models. In addition, using ColBERT as the teacher reduces training cost compared to a full cross-encoder. Experiments on the MS MARCO passage and document ranking tasks and data from the TREC 2019 Deep Learning Track demonstrate that our approach helps models learn robust representations for dense retrieval effectively and efficiently.

[1]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[2]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[3]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[4]  Jimmy J. Lin,et al.  Anserini: Reproducible Ranking Baselines Using Lucene , 2018, ACM J. Data Inf. Qual..

[5]  D. Cheriton From doc2query to docTTTTTquery , 2019 .

[6]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[7]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[8]  Allan Hanbury,et al.  Let's measure run time! Extending the IR replicability infrastructure to include performance aspects , 2019, OSIRRC@SIGIR.

[9]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[10]  Luyu Gao,et al.  Understanding BERT Rankers Under Distillation , 2020, ICTIR.

[11]  Allan Hanbury,et al.  Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation , 2020, ArXiv.

[12]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[13]  Oren Barkan,et al.  Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding , 2019, AAAI.

[14]  Jason Weston,et al.  Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , 2020, ICLR.

[15]  Yiqun Liu,et al.  Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently , 2020, ArXiv.

[16]  Jamie Callan,et al.  Context-Aware Term Weighting For First Stage Passage Retrieval , 2020, SIGIR.

[17]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[18]  Jamie Callan,et al.  Complementing Lexical Retrieval with Semantic Residual Embedding , 2020, ArXiv.

[19]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[21]  Hao Tian,et al.  ERNIE 2.0: A Continual Pre-training Framework for Language Understanding , 2019, AAAI.

[22]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[23]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[24]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2020, Transactions of the Association for Computational Linguistics.

[25]  Jimmy J. Lin,et al.  Pretrained Transformers for Text Ranking: BERT and Beyond , 2020, NAACL.

[26]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[27]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[28]  Jimmy J. Lin,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[29]  Jimmy J. Lin,et al.  A Replication Study of Dense Passage Retriever , 2021, ArXiv.

[30]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[31]  Jimmy J. Lin,et al.  Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations , 2021, SIGIR.

[32]  Paul N. Bennett,et al.  Less is More: Pre-training a Strong Siamese Encoder Using a Weak Decoder , 2021, ArXiv.