An Efficient Method of Training Small Models for Regression Problems with Knowledge Distillation

Compressing deep neural network (DNN) models becomes a very important and necessary technique for real-world applications, such as deploying those models on mobile devices. Knowledge distillation is one of the most popular methods for model compression, and many studies have been made on developing this technique. However, those studies mainly focused on classification problems, and very few attempts have been made on regression problems, although there are many application of DNNs on regression problems. In this paper, we propose a new formalism of knowledge distillation for regression problems. First, we propose a new loss function, teacher outlier rejection loss, which rejects outliers in training samples using teacher model predictions. Second, we consider a multi-task network with two outputs: one estimates training labels which is in general contaminated by noisy labels; And the other estimates teacher model's output which is expected to modify the noise labels following the memorization effects. By considering the multi-task network, training of the feature extraction of student models becomes more effective, and it allows us to obtain a better student model than one trained from scratch. We performed comprehensive evaluation with one simple toy model: sinusoidal function, and two open datasets: MPIIGaze, and Multi-PIE. Our results show consistent improvement in accuracy regardless of the annotation error level in the datasets.

[1]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[2]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[3]  Zhi-Hua Zhou,et al.  Automatic Age Estimation Based on Facial Aging Patterns , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Nassir Navab,et al.  Robust Optimization for Deep Regression , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[7]  Yoshua Bengio,et al.  FitNets: Hints for Thin Deep Nets , 2014, ICLR.

[8]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[9]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[10]  Josef Kittler,et al.  Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[14]  Mario Fritz,et al.  MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yale Song,et al.  Learning from Noisy Labels with Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[19]  Takeo Kanade,et al.  Multi-PIE , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[20]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[21]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[22]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[23]  Tony X. Han,et al.  Learning Efficient Object Detection Models with Knowledge Distillation , 2017, NIPS.

[24]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[26]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.