论文信息 - Knowledge Distillation from Few Samples

Knowledge Distillation from Few Samples

Current knowledge distillation methods require full training data to distill knowledge from a large "teacher" network to a compact "student" network by matching certain statistics between "teacher" and "student" such as softmax outputs and feature responses. This is not only time-consuming but also inconsistent with human cognition in which children can learn knowledge from adults with few examples. This paper proposes a novel and simple method for knowledge distillation from few samples. Taking the assumption that both "teacher" and "student" have the same feature map sizes at each corresponding block, we add a 1x1 conv-layer at the end of each block in the student-net, and align the block-level outputs between "teacher" and "student" by estimating the parameters of the added layer with limited samples. We prove that the added layer can be absorbed/merged into the previous conv-layer to formulate a new conv-layer with the same size of parameters and computation cost as the previous one. Experiments verify that the proposed method is very efficient and effective to distill knowledge from teacher-net to student-net constructing in different ways on various datasets.

[1] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3] Yoshua Bengio,et al. BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[4] Joan Bruna,et al. Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[5] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[6] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[7] Pritish Narayanan,et al. Deep Learning with Limited Numerical Precision , 2015, ICML.

[8] Jianxin Wu,et al. ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[10] François Fleuret,et al. Knowledge Transfer with Jacobian Matching , 2018, ICML.

[11] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[12] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[13] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[14] Hanan Samet,et al. Pruning Filters for Efficient ConvNets , 2016, ICLR.

[15] Joshua B. Tenenbaum,et al. One shot learning of simple visual concepts , 2011, CogSci.

[16] Naiyan Wang,et al. Like What You Like: Knowledge Distill via Neuron Selectivity Transfer , 2017, ArXiv.

[17] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[18] Igor Carron,et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[19] Eunhyeok Park,et al. Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications , 2015, ICLR.

[20] Rich Caruana,et al. Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[21] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Pietro Perona,et al. One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[24] Amos J. Storkey,et al. Moonshine: Distilling with Cheap Convolutions , 2017, NeurIPS.

[25] Tianqi Chen,et al. Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[26] Yoshua Bengio,et al. FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[27] Jian Sun,et al. Accelerating Very Deep Convolutional Networks for Classification and Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Weiyao Lin,et al. Network Decoupling: From Regular to Depthwise Separable Convolutions , 2018, BMVC.

[29] Zhi-Quan Luo,et al. Iteration complexity analysis of block coordinate descent methods , 2013, Mathematical Programming.

[30] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[31] Huchuan Lu,et al. Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Quoc V. Le,et al. Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[33] Andrew Zisserman,et al. Speeding up Convolutional Neural Networks with Low Rank Expansions , 2014, BMVC.

[34] Timothy Doster,et al. Gradual DropIn of Layers to Train Very Deep Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Koh Takeuchi,et al. Few-shot learning of neural networks from scratch by pseudo example optimization , 2018, BMVC.

[36] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37] Junmo Kim,et al. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Shimon Ullman,et al. Cross-generalization: learning novel classes from a single example by feature replacement , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[40] Thad Starner,et al. Data-Free Knowledge Distillation for Deep Neural Networks , 2017, ArXiv.

[41] Zhiqiang Shen,et al. Learning Efficient Convolutional Networks through Network Slimming , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).