Simplified TinyBERT: Knowledge Distillation for Document Retrieval

Despite the effectiveness of utilizing BERT for document ranking, the computational cost of such approaches is non-negligible when compared to other retrieval methods. To this end, this paper first empirically investigates the applications of knowledge distillation models on document ranking task. In addition, on top of the recent TinyBERT, two simplifications are proposed. Evaluation on MS MARCO document re-ranking task confirms the effectiveness of the proposed simplifications.

[1]  Jimmy J. Lin,et al.  Simple Applications of BERT for Ad Hoc Document Retrieval , 2019, ArXiv.

[2]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[3]  Jamie Callan,et al.  Deeper Text Understanding for IR with Contextual Neural Language Modeling , 2019, SIGIR.

[4]  Jimmy J. Lin,et al.  Document Ranking with a Pretrained Sequence-to-Sequence Model , 2020, FINDINGS.

[5]  Bhaskar Mitra,et al.  Overview of the TREC 2019 deep learning track , 2020, ArXiv.

[6]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[7]  Jianfeng Gao,et al.  A Human Generated MAchine Reading COmprehension Dataset , 2018 .

[8]  Nazli Goharian,et al.  CEDR: Contextualized Embeddings for Document Ranking , 2019, SIGIR.

[9]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[10]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[11]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[12]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[13]  Luyu Gao,et al.  Understanding BERT Rankers Under Distillation , 2020, ICTIR.

[14]  Jimmy J. Lin,et al.  Applying BERT to Document Retrieval with Birch , 2019, EMNLP.

[15]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[16]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[17]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.