Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

We focus on the problem of training a deep neural network in generations. The flowchart is that, in order to optimize the target network (student), another network (teacher) with the same architecture is first trained, and used to provide part of supervision signals in the next stage. While this strategy leads to a higher accuracy, many aspects (e.g., why teacher-student optimization helps) still need further explorations. This paper studies this problem from a perspective of controlling the strictness in training the teacher network. Existing approaches mostly used a hard distribution (e.g., one-hot vectors) in training, leading to a strict teacher which itself has a high accuracy, but we argue that the teacher needs to be more tolerant, although this often implies a lower accuracy. The implementation is very easy, with merely an extra loss term added to the teacher network, facilitating a few secondary classes to emerge and complement to the primary class. Consequently, the teacher provides a milder supervision signal (a less peaked distribution), and makes it possible for the student to learn from inter-class similarity and potentially lower the risk of over-fitting. Experiments are performed on standard image classification tasks (CIFAR100 and ILSVRC2012). Although the teacher network behaves less powerful, the students show a persistent ability growth and eventually achieve higher classification accuracies than other competitors. Model ensemble and transfer feature extraction also verify the effectiveness of our approach.

[1]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Xavier Gastaldi,et al.  Shake-Shake regularization , 2017, ArXiv.

[3]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[4]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[7]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Tianqi Chen,et al.  Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.

[9]  Vinod Nair,et al.  Learning hierarchical similarity metrics , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[11]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.

[15]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[22]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[23]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[24]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  Yan Wang,et al.  SORT: Second-Order Response Transform for Visual Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[30]  Qi Tian,et al.  Image-level classification by hierarchical structure learning with visual and semantic similarities , 2018, Inf. Sci..

[31]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[33]  Alan L. Yuille,et al.  Genetic CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[35]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[36]  Huchuan Lu,et al.  Deep Mutual Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Junmo Kim,et al.  Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[39]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[40]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[42]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[43]  Jingdong Wang,et al.  Interleaved Group Convolutions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[45]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[46]  Mark Tygert,et al.  Hierarchical loss for classification , 2017, ArXiv.