SMKD: Selective Mutual Knowledge Distillation

Mutual knowledge distillation (MKD) is a technique used to transfer knowledge between multiple models in a collaborative manner. However, it is important to note that not all knowledge is accurate or reliable, particularly under challenging conditions such as label noise, which can lead to models that memorize undesired information. This problem can be addressed by improving the reliability of the knowledge source, as well as selectively selecting reliable knowledge for distillation. While making a model more reliable is a widely studied topic, selective MKD has received less attention. To address this, we propose a new framework called selective mutual knowledge distillation (SMKD). The key component of SMKD is a generic knowledge selection formulation, which allows for either static or progressive selection thresholds. Additionally, SMKD covers two special cases: using no knowledge and using all knowledge, resulting in a unified MKD framework. We present extensive experimental results to demonstrate the effectiveness of SMKD and justify its design.

[1]  James Bailey,et al.  Normalized Loss Functions for Deep Learning with Noisy Labels , 2020, ICML.

[2]  Gang Niu,et al.  Dual T: Reducing Estimation Error for Transition Matrix in Label-noise Learning , 2020, NeurIPS.

[3]  Shaogang Gong,et al.  Peer Collaborative Learning for Online Knowledge Distillation , 2020, AAAI.

[4]  Xiaolin Hu,et al.  Online Knowledge Distillation via Collaborative Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Tao Wang,et al.  Revisiting Knowledge Distillation via Label Smoothing Regularization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  David A. Clifton,et al.  ProSelfLC: Progressive Self Label Correction for Training Robust Deep Neural Networks , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Nojun Kwak,et al.  Feature-map-level Online Adversarial Knowledge Distillation , 2020, ICML.

[8]  James Bailey,et al.  Symmetric Cross Entropy for Robust Learning With Noisy Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[10]  Gang Niu,et al.  Are Anchor Points Really Indispensable in Label-Noise Learning? , 2019, NeurIPS.

[11]  Pengfei Chen,et al.  Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels , 2019, ICML.

[12]  Noel E. O'Connor,et al.  Unsupervised label noise modeling and loss correction , 2019, ICML.

[13]  Jangho Kim,et al.  Feature Fusion for Online Mutual Knowledge Distillation , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[14]  Xingrui Yu,et al.  How does Disagreement Help Generalization against Label Corruption? , 2019, ICML.

[15]  Mert R. Sabuncu,et al.  Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels , 2018, NeurIPS.

[16]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[17]  Masashi Sugiyama,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[18]  Kiyoharu Aizawa,et al.  Joint Optimization Framework for Learning with Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[20]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[22]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[23]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[24]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[28]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[29]  Samira Ebrahimi Kahou,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[30]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[31]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[32]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[34]  Chen Gong,et al.  Robust early-learning: Hindering the memorization of noisy labels , 2021, ICLR.

[35]  Yizhou Wang,et al.  L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise , 2019, NeurIPS.

[36]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[37]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .