Feature Fusion for Online Mutual Knowledge Distillation

We propose a learning framework named Feature Fusion Learning (FFL) that efficiently trains a powerful classifier through a fusion module which combines the feature maps generated from parallel neural networks. Specifically, we train a number of parallel neural networks as sub-networks, then we combine the feature maps from each sub-network using a fusion module to create a more meaningful feature map. The fused feature map is passed into the fused classifier for overall classification. Unlike existing feature fusion methods, in our framework, an ensemble of sub-network classifiers transfers its knowledge to the fused classifier and then the fused classifier delivers its knowledge back to each sub-network, mutually teaching one another in an online-knowledge distillation manner. This mutually teaching system not only improves the performance of the fused classifier but also obtains performance gain in each sub-network. Moreover, our model is more beneficial because different types of network can be used for each sub-network. We have performed a variety of experiments on multiple datasets such as CIFAR-10, CIFAR-100 and ImageNet and proved that our method is more effective than other alternative methods in terms of performance of both sub-networks and the fused classifier.

[1]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Nojun Kwak,et al.  Feature-map-level Online Adversarial Knowledge Distillation , 2020, ICML.

[4]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Haibin Ling,et al.  Multi-Level Contextual RNNs With Attention Model for Scene Labeling , 2016, IEEE Transactions on Intelligent Transportation Systems.

[6]  Xu Liu,et al.  DualNet: Learn Complementary Features for Image Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Jinwon Lee,et al.  QKD: Quantization-aware Knowledge Distillation , 2019, ArXiv.

[8]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[9]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[14]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[15]  Nojun Kwak,et al.  Position-based Scaled Gradient for Model Quantization and Sparse Training , 2020, ArXiv.

[16]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[17]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[19]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[20]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[22]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[23]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Jin Young Choi,et al.  Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons , 2018, AAAI.

[25]  Guocong Song,et al.  Collaborative Learning for Deep Neural Networks , 2018, NeurIPS.

[26]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[27]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[28]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.

[30]  Jangho Kim,et al.  Paraphrasing Complex Network: Network Compression via Factor Transfer , 2018, NeurIPS.

[31]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[32]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[33]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[34]  Tao Xiang,et al.  Multi-level Factorisation Net for Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.