Memory-Replay Knowledge Distillation

Knowledge Distillation (KD), which transfers the knowledge from a teacher to a student network by penalizing their Kullback–Leibler (KL) divergence, is a widely used tool for Deep Neural Network (DNN) compression in intelligent sensor systems. Traditional KD uses pre-trained teacher, while self-KD distills its own knowledge to achieve better performance. The role of the teacher in self-KD is usually played by multi-branch peers or the identical sample with different augmentation. However, the mentioned self-KD methods above have their limitation for widespread use. The former needs to redesign the DNN for different tasks, and the latter relies on the effectiveness of the augmentation method. To avoid the limitation above, we propose a new self-KD method, Memory-replay Knowledge Distillation (MrKD), that uses the historical models as teachers. Firstly, we propose a novel self-KD training method that penalizes the KD loss between the current model’s output distributions and its backup outputs on the training trajectory. This strategy can regularize the model with its historical output distribution space to stabilize the learning. Secondly, a simple Fully Connected Network (FCN) is applied to ensemble the historical teacher’s output for a better guidance. Finally, to ensure the teacher outputs offer the right class as ground truth, we correct the teacher logit output by the Knowledge Adjustment (KA) method. Experiments on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset DCASE) classification tasks show that MrKD improves single model training and working efficiently across different datasets. In contrast to the existing fancy self-KD methods with various external knowledge, the effectiveness of MrKD sheds light on the usually abandoned historical models during the training trajectory.

[1]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[2]  Tuomas Virtanen,et al.  A multi-device dataset for urban acoustic scene classification , 2018, DCASE.

[3]  Xiaolin Hu,et al.  Knowledge Distillation via Route Constrained Optimization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Guocong Song,et al.  Collaborative Learning for Deep Neural Networks , 2018, NeurIPS.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Annamaria Mesaros,et al.  Acoustic Scene Classification in DCASE 2020 Challenge: Generalization Across Devices and Low Complexity Solutions , 2020, DCASE.

[7]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[10]  Yong Seok Heo,et al.  Knowledge Distillation for Semantic Segmentation Using Channel and Spatial Correlations and Adaptive Cross Entropy , 2020, Sensors.

[11]  Gerhard Widmer,et al.  Low-Complexity Models for Acoustic Scene Classification Based on Receptive Field Regularization and Frequency Damping , 2020, ArXiv.

[12]  Yi Xu,et al.  Towards Understanding Label Smoothing , 2020, ArXiv.

[13]  Yafei Dai,et al.  MSD: Multi-Self-Distillation Learning via Multi-classifiers within Deep Neural Networks , 2019, ArXiv.

[14]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[15]  Jeongtae Kim,et al.  Machine Learning-Based Fast Banknote Serial Number Recognition Using Knowledge Distillation and Bayesian Optimization , 2019, Sensors.

[16]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[18]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[19]  Shuicheng Yan,et al.  Dual Path Networks , 2017, NIPS.

[20]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[21]  Cheng-Lin Liu,et al.  Data-Distortion Guided Self-Distillation for Deep Neural Networks , 2019, AAAI.

[22]  Shaogang Gong,et al.  Peer Collaborative Learning for Online Knowledge Distillation , 2021, AAAI.

[23]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Sangheum Hwang,et al.  Self-Knowledge Distillation: A Simple Way for Better Generalization , 2020, ArXiv.

[25]  Amos J. Storkey,et al.  School of Informatics, University of Edinburgh , 2022 .

[26]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Minsik Lee,et al.  Building a Compact Convolutional Neural Network for Embedded Intelligent Sensor Systems Using Group Sparsity and Knowledge Distillation , 2019, Sensors.

[28]  Jinwoo Shin,et al.  Regularizing Class-Wise Predictions via Self-Knowledge Distillation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[30]  Gerhard Widmer,et al.  The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[31]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[32]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[35]  Barbara Siemiatkowska,et al.  A System for Weeds and Crops Identification—Reaching over 10 FPS on Raspberry Pi with the Usage of MobileNets, DenseNet and Custom Modifications , 2019, Sensors.

[36]  Jang Hyun Cho,et al.  On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[38]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[39]  Seyed Iman Mirzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant , 2020, AAAI.

[40]  Balaji Lakshminarayanan,et al.  AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty , 2020, ICLR.

[41]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[42]  Chun Chen,et al.  Online Knowledge Distillation with Diverse Peers , 2019, AAAI.

[43]  Liang Gao,et al.  Multistructure-Based Collaborative Online Distillation , 2019, Entropy.

[44]  Xing Fan,et al.  Knowledge Distillation from Internal Representations , 2020, AAAI.

[45]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[46]  Kilian Q. Weinberger,et al.  Convolutional Networks with Dense Connectivity , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[48]  Xueming Qian,et al.  Preparing Lessons: Improve Knowledge Distillation with Better Supervision , 2019, Neurocomputing.