Distilling Task-Specific Knowledge from BERT via Adversarial Belief Matching

Large pre-trained language models such as BERT [1] have achieved strong results when fine-tuned on a variety of natural language tasks but are cumbersome to deploy. Applying knowledge distillation (KD) [2] to compress these pre-trained models for a specific downstream task is challenging due to the small amount of task-specific labeled data, resulting in poor performance by the compressed model. Considerable efforts have been spent to improve the distillation process for BERT, involving techniques such as leveraging intermediate hints [3], student pre-training [4] and data augmentation [5].