论文信息 - Distilling Task-Specific Knowledge from BERT via Adversarial Belief Matching

Distilling Task-Specific Knowledge from BERT via Adversarial Belief Matching

Large pre-trained language models such as BERT [1] have achieved strong results when fine-tuned on a variety of natural language tasks but are cumbersome to deploy. Applying knowledge distillation (KD) [2] to compress these pre-trained models for a specific downstream task is challenging due to the small amount of task-specific labeled data, resulting in poor performance by the compressed model. Considerable efforts have been spent to improve the distillation process for BERT, involving techniques such as leveraging intermediate hints [3], student pre-training [4] and data augmentation [5].

H. H. Mao | Bodhisattwa Prasad Majumder | G. Cottrell | Julian McAuley

[1] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2] Xin Jiang,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[3] Razvan Pascanu,et al. Sobolev Training for Neural Networks , 2017, NIPS.

[4] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[5] Jimmy J. Lin,et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[6] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[7] Ankur P. Parikh,et al. Thieves on Sesame Street! Model Extraction of BERT-based APIs , 2019, ICLR.

[8] Amos Storkey,et al. Zero-shot Knowledge Transfer via Adversarial Belief Matching , 2019, NeurIPS.

[9] Yoshua Bengio,et al. FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[10] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.