Improved Knowledge Distillation via Adversarial Collaboration

Knowledge distillation has become an important approach to obtain a compact yet effective model. To achieve this goal, a small student model is trained to exploit the knowledge of a large well-trained teacher model. However, due to the capacity gap between the teacher and the student, the student’s performance is hard to reach the level of the teacher. Regarding this issue, existing methods propose to reduce the difficulty of the teacher’s knowledge via a proxy way. We argue that these proxy-based methods overlook the knowledge loss of the teacher, which may cause the student to encounter capacity bottlenecks. In this paper, we alleviate the capacity gap problem from a new perspective with the purpose of averting knowledge loss. Instead of sacrificing part of the teacher’s knowledge, we propose to build a more powerful student via adversarial collaborative learning. To this end, we further propose an Adversarial Collaborative Knowledge Distillation (ACKD) method that effectively improves the performance of knowledge distillation. Specifically, we construct the student model with multiple auxiliary learners. Meanwhile, we devise an adversarial collaborative module (ACM) that introduces attention mechanism and adversarial learning to enhance the capacity of the student. Extensive experiments on four classification tasks show the superiority of the proposed ACKD.

[1]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Jang Hyun Cho,et al.  On the Efficacy of Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Shijian Lu,et al.  AMLN: Adversarial-Based Mutual Learning Network for Online Knowledge Distillation , 2020, ECCV.

[5]  Jian-Ping Mei,et al.  Cross-Layer Distillation with Semantic Calibration , 2020, AAAI.

[6]  Zhigeng Pan,et al.  Online Knowledge Distillation via Multi-branch Diversity Enhancement , 2020, ArXiv.

[7]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[8]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[9]  Mingli Song,et al.  Student Becoming the Master: Knowledge Amalgamation for Joint Scene Parsing, Depth Estimation, and More , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[11]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[14]  Xiaolin Hu,et al.  Knowledge Distillation via Route Constrained Optimization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Jing Liu,et al.  Discrimination-aware Channel Pruning for Deep Neural Networks , 2018, NeurIPS.

[17]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Pierre Dillenbourg,et al.  Collaborative Learning: Cognitive and Computational Approaches. Advances in Learning and Instruction Series. , 1999 .

[19]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Nojun Kwak,et al.  Feature-map-level Online Adversarial Knowledge Distillation , 2020, ICML.

[21]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[22]  Yonglong Tian,et al.  Contrastive Representation Distillation , 2019, ICLR.

[23]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Dacheng Tao,et al.  Learning With Single-Teacher Multi-Student , 2018, AAAI.

[25]  Zhiqiang Shen,et al.  MEAL: Multi-Model Ensemble via Adversarial Learning , 2018, AAAI.

[26]  Zhiqiang Liu,et al.  CT synthesis from MRI using multi-cycle GAN for head-and-neck radiation therapy , 2021, Comput. Medical Imaging Graph..

[27]  Jiashi Feng,et al.  Revisit Knowledge Distillation: a Teacher-free Framework , 2019, ArXiv.

[28]  Chun Chen,et al.  Online Knowledge Distillation with Diverse Peers , 2019, AAAI.

[29]  Dawei Sun,et al.  Knowledge Transfer via Dense Cross-Layer Mutual-Distillation , 2020, ECCV.

[30]  Neil D. Lawrence,et al.  Variational Information Distillation for Knowledge Transfer , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Xiaolin Hu,et al.  Online Knowledge Distillation via Collaborative Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Kaisheng Ma,et al.  Task-Oriented Feature Distillation , 2020, NeurIPS.

[33]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[34]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[35]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[36]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[38]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[39]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Diana Marculescu,et al.  Towards Efficient Model Compression via Learned Global Ranking , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Mingkui Tan,et al.  Deep Transferring Quantization , 2020, ECCV.

[42]  Guocong Song,et al.  Collaborative Learning for Deep Neural Networks , 2018, NeurIPS.

[43]  Greg Mori,et al.  Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yu Liu,et al.  Correlation Congruence for Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[49]  Xiangyu Zhang,et al.  Channel Pruning for Accelerating Very Deep Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Xiangyu Zhang,et al.  ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design , 2018, ECCV.

[51]  Seyed Iman Mirzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant , 2020, AAAI.