Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

With ever growing scale of neural models, knowledge distillation (KD) attracts more attention as a prominent tool for neural model compression. However, there are counter intuitive observations in the literature showing some challenging limitations of KD. A case in point is that the best performing checkpoint of the teacher might not necessarily be the best teacher for training the student in KD. Therefore, one important question would be how to find the best checkpoint of the teacher for distillation? Searching through the checkpoints of the teacher would be a very tedious and computationally expensive process, which we refer to as the checkpoint-search problem. Moreover, another observation is that larger teachers might not necessarily be better teachers in KD which is referred to as the capacitygap problem. To address these challenging problems, in this work, we introduce our progressive knowledge distillation (Pro-KD) technique which defines a smoother training path for the student by following the training footprints of the teacher instead of solely relying on distilling from a single mature fully-trained teacher. We demonstrate that our technique is quite effective in mitigating the capacity-gap problem and the checkpoint search problem. We evaluate our technique using a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and CIFAR100), natural language understanding tasks of the GLUE benchmark, and question answering (SQuAD 1.1 and 2.0) using BERT-based models and consistently got superior results over state-of-the-art techniques.

[1]  Mehdi Rezagholizadeh,et al.  Not Far Away, Not So Close: Sample Efficient Nearest Neighbour Data Augmentation via MiniMax , 2021, FINDINGS.

[2]  Mehdi Rezagholizadeh,et al.  Towards Zero-Shot Knowledge Distillation for Natural Language Processing , 2020, EMNLP.

[3]  B. Jafarpour,et al.  Active Curriculum Learning , 2021, INTERNLP.

[4]  Bernhard Schölkopf,et al.  Unifying distillation and privileged information , 2015, ICLR.

[5]  Mehdi Rezagholizadeh,et al.  MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation , 2021, ACL.

[6]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Samet Oymak,et al.  Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[9]  Mehdi Rezagholizadeh,et al.  ALP-KD: Attention-Based Layer Projection for Knowledge Distillation , 2020, AAAI.

[10]  Jang Hyun Cho,et al.  On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Kaisheng M. Wang,et al.  PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.

[12]  Hassan Ghasemzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher , 2019, ArXiv.

[13]  Jinwoo Shin,et al.  Regularizing Class-Wise Predictions via Self-Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Qun Liu,et al.  Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers , 2020, EMNLP.

[16]  Xiaolin Hu,et al.  Knowledge Distillation via Route Constrained Optimization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Dongpeng Chen,et al.  Essence Knowledge Distillation for Speech Recognition , 2019, ArXiv.

[18]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[19]  Xiaolin Hu,et al.  Online Knowledge Distillation via Collaborative Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ali Ghodsi,et al.  Annealing Knowledge Distillation , 2021, EACL.

[21]  David D. Cox,et al.  On the information bottleneck theory of deep learning , 2018, ICLR.

[22]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[23]  Nam Soo Kim,et al.  TutorNet: Towards Flexible Knowledge Distillation for End-to-End Speech Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Yevgen Chebotar,et al.  Distilling Knowledge from Ensembles of Neural Networks for Speech Recognition , 2016, INTERSPEECH.

[25]  Kyoung-Woon On,et al.  Toward General Scene Graph: Integration of Visual Semantic Knowledge with Entity Synset Alignment , 2020, ALVR.

[26]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .