MixKD: Towards Efficient Distillation of Large-scale Language Models

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove, from a theoretical perspective, that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.

[1]  Yu Sun,et al.  ERNIE: Enhanced Representation through Knowledge Integration , 2019, ArXiv.

[2]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[3]  Sungroh Yoon,et al.  Adversarial Vertex Mixup: Toward Better Adversarially Robust Generalization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[5]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[6]  Seyed Iman Mirzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant , 2020, AAAI.

[7]  Quoc V. Le,et al.  Unsupervised Data Augmentation for Consistency Training , 2019, NeurIPS.

[8]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[9]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[10]  Jianfeng Gao,et al.  UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training , 2020, ICML.

[11]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[12]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[13]  Xiaodong Liu,et al.  Multi-Task Deep Neural Networks for Natural Language Understanding , 2019, ACL.

[14]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[15]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[16]  Kai Zou,et al.  EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks , 2019, EMNLP.

[17]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[18]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[19]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[20]  Xing Wu,et al.  Conditional BERT Contextual Augmentation , 2018, ICCS.

[21]  Quoc V. Le,et al.  BAM! Born-Again Multi-Task Networks for Natural Language Understanding , 2019, ACL.

[22]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[23]  Subhabrata Mukherjee,et al.  Distilling BERT into Simple Neural Networks with Unlabeled Transfer Data , 2019 .

[24]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[25]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Partha Talukdar,et al.  Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation , 2019, NAACL.

[28]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[29]  Huan Wang,et al.  MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models , 2019 .

[30]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[31]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[32]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[33]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Dongdong Wang,et al.  Neural Networks Are More Productive Teachers Than Human Raters: Active Mixup for Data-Efficient Knowledge Distillation From a Blackbox Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[37]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[38]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Yang Song,et al.  Extreme Language Model Compression with Optimal Subwords and Shared Projections , 2019, ArXiv.

[40]  Jonathan Baxter,et al.  A Model of Inductive Bias Learning , 2000, J. Artif. Intell. Res..

[41]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[42]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[43]  Subhabrata Mukherjee,et al.  Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data , 2019, ArXiv.

[44]  Hongyu Guo,et al.  Augmenting Data with Mixup for Sentence Classification: An Empirical Study , 2019, ArXiv.