PROD: Progressive Distillation for Dense Retrieval

Knowledge distillation is an effective way to transfer knowledge from a strong teacher to an efficient student model. Ideally, we expect the better the teacher is, the better the student performs. However, this expectation does not always come true. It is common that a strong teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student. To bridge the gap, we propose PROD, a PROgressive Distillation method, for dense retrieval. PROD consists of a teacher progressive distillation and a data progressive distillation to gradually improve the student. To alleviate catastrophic forgetting, we introduce a regularization term in each distillation process. We conduct extensive experiments on seven datasets including five widely-used publicly available benchmarks: MS MARCO Passage, TREC Passage 19, TREC Document 19, MS MARCO Document, and Natural Questions, as well as two industry datasets: Bing-Rel and Bing-Ads. PROD achieves the state-of-the-art in the distillation methods for dense retrieval. Our 6-layer student model even surpasses most of the existing 12-layer models on all five public benchmarks. The code and models are released in https://github.com/microsoft/SimXNS.

[1]  Tao Shen,et al.  LED: Lexicon-Enlightened Dense Retriever for Large-Scale Retrieval , 2022, WWW.

[2]  Hua Wu,et al.  ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval , 2022, ArXiv.

[3]  Hamed Zamani,et al.  Curriculum Learning for Dense Retrieval Distillation , 2022, SIGIR.

[4]  Raffay Hamid,et al.  Robust Cross-Modal Representation Learning with Progressive Self-Distillation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Chuhan Wu,et al.  Unified and Effective Ensemble Knowledge Distillation , 2022, ArXiv.

[6]  Tim Salimans,et al.  Progressive Distillation for Fast Sampling of Diffusion Models , 2022, ICLR.

[7]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[8]  M. Zaharia,et al.  ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction , 2021, NAACL.

[9]  Ali Ghodsi,et al.  Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher , 2021, COLING.

[10]  I. E. Yen,et al.  Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm , 2021, ACL.

[11]  Wayne Xin Zhao,et al.  RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking , 2021, EMNLP.

[12]  Weizhu Chen,et al.  Adversarial Retriever-Ranker for dense text retrieval , 2021, ICLR.

[13]  Benjamin Piwowarski,et al.  SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval , 2021, ArXiv.

[14]  Hua Wu,et al.  PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval , 2021, FINDINGS.

[15]  Wen-tau Yih,et al.  Domain-matched Pre-training Tasks for Dense Retrieval , 2021, NAACL-HLT.

[16]  Julian McAuley,et al.  BERT Learns to Teach: Knowledge Distillation with Meta Learning , 2021, ACL.

[17]  Hao Tian,et al.  ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression , 2021, ArXiv.

[18]  Se-Young Yun,et al.  Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation , 2021, IJCAI.

[19]  Jiaya Jia,et al.  Distilling Knowledge via Knowledge Review , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jiafeng Guo,et al.  Optimizing Dense Retrieval Model Training with Hard Negatives , 2021, SIGIR.

[21]  Jimmy J. Lin,et al.  Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling , 2021, SIGIR.

[22]  William L. Hamilton,et al.  End-to-End Training of Neural Retrievers for Open-Domain Question Answering , 2021, ACL.

[23]  Jun Zhao,et al.  Incremental Event Detection via Knowledge Consolidation Networks , 2020, EMNLP.

[24]  Jimmy J. Lin,et al.  Distilling Dense Representations for Ranking using Tightly-Coupled Teachers , 2020, ArXiv.

[25]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[26]  Allan Hanbury,et al.  Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation , 2020, ArXiv.

[27]  Tiancheng Zhao,et al.  SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval , 2020, NAACL.

[28]  Minjoon Seo,et al.  Is Retriever Merely an Approximator of Reader? , 2020, ArXiv.

[29]  Yejin Choi,et al.  Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[30]  Yelong Shen,et al.  Generation-Augmented Retrieval for Open-Domain Question Answering , 2020, ACL.

[31]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[32]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[33]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[34]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[35]  Ming Zhou,et al.  ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training , 2020, FINDINGS.

[36]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2019, ICLR.

[37]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[38]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[39]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[40]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[41]  Jamie Callan,et al.  Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[42]  Jimmy J. Lin,et al.  Document Expansion by Query Prediction , 2019, ArXiv.

[43]  Seyed Iman Mirzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant , 2019, AAAI.

[44]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[45]  Bhuvana Ramabhadran,et al.  Efficient Knowledge Distillation from an Ensemble of Teachers , 2017, INTERSPEECH.

[46]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[47]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[48]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[49]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[50]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[52]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[53]  Yejin Choi,et al.  Understanding Dataset Difficulty with V-Usable Information , 2022, ICML.

[54]  Jimmy J. Lin,et al.  In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval , 2021, REPL4NLP.

[55]  Nick Craswell,et al.  O VERVIEW OF THE TREC 2019 DEEP LEARNING TRACK , 2020 .

[56]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.