Feature-Level Ensemble Knowledge Distillation for Aggregating Knowledge from Multiple Networks

Knowledge Distillation (KD) aims to transfer knowledge in a teacher-student framework, by providing the predictions of the teacher network to the student network in the training stage to help the student network generalize better. It can use either a teacher with high capacity or an ensemble of multiple teachers. However, the latter is not convenient when one wants to use feature-map-based distillation methods. In this paper, we empirically show that using several non-linear transformation layer cope well with multiple-teacher setting compared to other kinds of feature-map-level distillation methods. Comprehensively, this paper proposes a versatile and powerful training algorithm named FEature-level Ensemble knowledge Distillation (FEED), which aims to transfer the ensemble knowledge using multiple teacher networks. In this study, we introduce a couple of training algorithms that transfer ensemble knowledge to the student at the feature-map-level. Among the feature-map-level distillation methods, using several non-linear transformations in parallel for transferring the knowledge of the multiple teachers helps the student find more generalized solutions. We name this method as parallel FEED, and experimental results on CIFAR-100 and ImageNet show that our method has clear performance enhancements, without introducing any additional parameters or computations at test time. We also show the experimental results of sequentially feeding teacher’s information to the student, hence the name sequential FEED, and discuss the lessons obtained. Additionally, the empirical results on measuring the reconstruction errors at the feature map give hints for the enhancements.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Ying Cui,et al.  Convex Principal Feature Selection , 2010, SDM.

[3]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[4]  Huan Liu,et al.  Reconstruction-based Unsupervised Feature Selection: An Embedded Approach , 2017, IJCAI.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Quoc V. Le,et al.  Efficient Neural Architecture Search via Parameter Sharing , 2018, ICML.

[7]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jangho Kim,et al.  Paraphrasing Complex Network: Network Compression via Factor Transfer , 2018, NeurIPS.

[9]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kaiming He,et al.  Data Distillation: Towards Omni-Supervised Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[13]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[14]  Li Sun,et al.  Amalgamating Knowledge towards Comprehensive Classification , 2018, AAAI.

[15]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[16]  Geoffrey E. Hinton,et al.  Distilling a Neural Network Into a Soft Decision Tree , 2017, CEx@AI*IA.

[17]  Naiyan Wang,et al.  Like What You Like: Knowledge Distill via Neuron Selectivity Transfer , 2017, ArXiv.

[18]  Mohamed S. Kamel,et al.  An Efficient Greedy Method for Unsupervised Feature Selection , 2011, 2011 IEEE 11th International Conference on Data Mining.

[19]  Albert Gordo,et al.  Learning Global Additive Explanations for Neural Nets Using Model Distillation , 2018 .

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Matthew Richardson,et al.  Blending LSTMs into CNNs , 2015, ICLR 2016.

[22]  Dacheng Tao,et al.  Learning from Multiple Teacher Networks , 2017, KDD.

[23]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[24]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[25]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[26]  Christos Boutsidis,et al.  Unsupervised feature selection for principal components analysis , 2008, KDD.

[27]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[28]  Xu Lan,et al.  Knowledge Distillation by On-the-Fly Native Ensemble , 2018, NeurIPS.