Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation

The latency of neural ranking models at query time is largely dependent on the architecture and deliberate choices by their designers to trade-off effectiveness for higher efficiency. This focus on low query latency of a rising number of efficient ranking architectures make them feasible for production deployment. In machine learning an increasingly common approach to close the effectiveness gap of more efficient models is to apply knowledge distillation from a large teacher model to a smaller student model. We find that different ranking architectures tend to produce output scores in different magnitudes. Based on this finding, we propose a cross-architecture training procedure with a margin focused loss (Margin-MSE), that adapts knowledge distillation to the varying score output distributions of different BERT and non-BERT ranking architectures. We apply the teachable information as additional fine-grained labels to existing training triples of the MSMARCO-Passage collection. We evaluate our procedure of distilling knowledge from state-of-the-art concatenated BERT models to four different efficient architectures (TK, ColBERT, PreTT, and a BERT CLS dot product model). We show that across our evaluated architectures our Margin-MSE knowledge distillation significantly improves their effectiveness without compromising their efficiency. To benefit the community, we publish the costly teacher-score training files in a ready-to-use package.

[1]  James P. Callan,et al.  Context-Aware Document Term Weighting for Ad-Hoc Search , 2020, WWW.

[2]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2020, Transactions of the Association for Computational Linguistics.

[3]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[4]  Raffaele Perego,et al.  Efficient Document Re-Ranking for Transformers by Precomputing Term Representations , 2020, SIGIR.

[5]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[6]  Xin Jiang,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[7]  Christopher J. C. Burges,et al.  From RankNet to LambdaRank to LambdaMART: An Overview , 2010 .

[8]  Luyu Gao,et al.  Understanding BERT Rankers Under Distillation , 2020, ICTIR.

[9]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[10]  Jimmy J. Lin,et al.  Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval , 2019, EMNLP.

[11]  Raffaele Perego,et al.  Training Curricula for Open Domain Answer Re-Ranking , 2020, SIGIR.

[12]  Bernhard Schölkopf,et al.  Fidelity-Weighted Learning , 2017, ICLR.

[13]  Nazli Goharian,et al.  CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[14]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[15]  Hamed Zamani,et al.  Conformer-Kernel with Query Term Independence for Document Retrieval , 2020, ArXiv.

[16]  Le Song,et al.  DC-BERT: Decoupling Question and Document for Efficient Contextual Encoding , 2020, SIGIR.

[17]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[18]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[19]  Jimmy J. Lin,et al.  Anserini: Enabling the Use of Lucene for Information Retrieval Research , 2017, SIGIR.

[20]  Le Sun,et al.  Simplified TinyBERT: Knowledge Distillation for Document Retrieval , 2020, ArXiv.

[21]  Azadeh Shakery,et al.  Distilling Knowledge for Fast Retrieval-based Chat-bots , 2020, SIGIR.

[22]  Allan Hanbury,et al.  Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking , 2020, ECAI.

[23]  Jian Jiao,et al.  TwinBERT: Distilling Knowledge to Twin-Structured BERT Models for Efficient Retrieval , 2020, ArXiv.

[24]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[25]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[26]  Ke Wang,et al.  Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System , 2018, KDD.

[27]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[28]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[29]  Chenliang Li,et al.  IDST at TREC 2019 Deep Learning Track: Deep Cascade Ranking with Generation-based Document Expansion and Pre-trained Language Modeling , 2019, TREC.

[30]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[31]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Yingfei Sun,et al.  PARADE: Passage Representation Aggregation for Document Reranking , 2020, ACM Transactions on Information Systems.

[34]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[35]  Wei-Cheng Chang,et al.  Pre-training Tasks for Embedding-based Large-scale Retrieval , 2020, ICLR.

[36]  Jaap Kamps,et al.  Learning to Learn from Weak Supervision by Full Supervision , 2017, ArXiv.

[37]  Jamie Callan,et al.  EARL: Speedup Transformer-based Rankers with Pre-computed Representation , 2020, ArXiv.

[38]  Paul N. Bennett,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[39]  W. Bruce Croft,et al.  Learning a Better Negative Sampling Policy with Deep Neural Networks for Search , 2019, ICTIR.

[40]  Allan Hanbury,et al.  Let's measure run time! Extending the IR replicability infrastructure to include performance aspects , 2019, OSIRRC@SIGIR.

[41]  Joel Mackenzie,et al.  Efficiency Implications of Term Weighting for Passage Retrieval , 2020, SIGIR.

[42]  Allan Hanbury,et al.  Local Self-Attention over Long Text for Efficient Document Retrieval , 2020, SIGIR.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Andrew Yates,et al.  Content-Based Weak Supervision for Ad-Hoc Re-Ranking , 2017, SIGIR.

[45]  Tao Yang,et al.  Efficient Interaction-based Neural Ranking with Locality Sensitive Hashing , 2019, WWW.

[46]  Zhiyuan Liu,et al.  End-to-End Neural Ad-hoc Ranking with Kernel Pooling , 2017, SIGIR.