Student Helping Teacher: Teacher Evolution via Self-Knowledge Distillation

Knowledge distillation usually transfers the knowledge from a pre-trained cumbersome teacher network to a compact student network, which follows the classical teacher-teachingstudent paradigm. Based on this paradigm, previous methods mostly focus on how to efficiently train a better student network for deployment. Different from the existing practices, in this paper, we propose a novel student-helpingteacher formula, Teacher Evolution via Self-Knowledge Distillation (TESKD), where the target teacher (for deployment) is learned with the help of multiple hierarchical students by sharing the structural backbone. The diverse feedback from multiple students allows the teacher to improve itself through the shared feature representations. The effectiveness of our proposed framework is demonstrated by extensive experiments with various network settings on two standard benchmarks including CIFAR-100 and ImageNet. Notably, when trained together with our proposed method, ResNet-18 achieves 79.15% and 71.14% accuracy on CIFAR-100 and ImageNet, outperforming the baseline results by 4.74% and 1.43%, respectively. The code is available at: https://github. com/zhengli427/TESKD.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Jun Li,et al.  Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jun Li,et al.  Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection , 2020, NeurIPS.

[5]  Yu Liu,et al.  Correlation Congruence for Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[8]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[9]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[10]  Jangho Kim,et al.  Paraphrasing Complex Network: Network Compression via Factor Transfer , 2018, NeurIPS.

[11]  Mingli Song,et al.  Online Knowledge Distillation for Efficient Pose Estimation , 2021, ArXiv.

[12]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[13]  Xiang Li,et al.  Mixed Link Networks , 2018, IJCAI.

[14]  Marios Savvides,et al.  Online Ensemble Model Compression Using Knowledge Distillation , 2020, ECCV.

[15]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Ke Chen,et al.  Structured Knowledge Distillation for Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Chun Chen,et al.  Online Knowledge Distillation with Diverse Peers , 2019, AAAI.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Mao Ye,et al.  Fast Human Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[21]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dawei Sun,et al.  Deeply-Supervised Knowledge Synergy , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Timo Aila,et al.  Pruning Convolutional Neural Networks for Resource Efficient Inference , 2016, ICLR.

[24]  Xiaogang Wang,et al.  Multi-context Attention for Human Pose Estimation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Hanan Samet,et al.  Pruning Filters for Efficient ConvNets , 2016, ICLR.

[27]  Dawei Sun,et al.  Knowledge Transfer via Dense Cross-Layer Mutual-Distillation , 2020, ECCV.

[28]  Junjie Yan,et al.  Mimicking Very Efficient Network for Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Jinwoo Shin,et al.  Regularizing Class-Wise Predictions via Self-Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Il-Chul Moon,et al.  Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge Distillation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yan Lu,et al.  Relational Knowledge Distillation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Zhigeng Pan,et al.  Online Knowledge Distillation via Multi-branch Diversity Enhancement , 2020, ArXiv.

[37]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[38]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Yixin Chen,et al.  Compressing Neural Networks with the Hashing Trick , 2015, ICML.

[42]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[43]  Jian Yang,et al.  Selective Kernel Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[45]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[47]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Cheng-Lin Liu,et al.  Data-Distortion Guided Self-Distillation for Deep Neural Networks , 2019, AAAI.

[49]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Neil D. Lawrence,et al.  Variational Information Distillation for Knowledge Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Xiaolin Hu,et al.  Online Knowledge Distillation via Collaborative Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.