When Does Label Smoothing Help?

The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.

[1]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[2]  Eric B. Baum,et al.  Supervised Learning of Probability Distributions by Neural Networks , 1987, NIPS.

[3]  Esther Levin,et al.  Accelerated Learning in Layered Neural Networks , 1988, Complex Syst..

[4]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[5]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[6]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[9]  Naftali Tishby,et al.  Deep learning and the information bottleneck principle , 2015, 2015 IEEE Information Theory Workshop (ITW).

[10]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Qi Tian,et al.  DisturbLabel: Regularizing CNN on the Loss Layer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Naftali Tishby,et al.  Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.

[15]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[16]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[17]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[18]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[19]  Marc'Aurelio Ranzato,et al.  Analyzing Uncertainty in Neural Machine Translation , 2018, ICML.

[20]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[22]  Quoc V. Le,et al.  Do Better ImageNet Models Transfer Better? , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sunita Sarawagi,et al.  Calibration of Encoder Decoder Models for Neural Machine Translation , 2019, ArXiv.

[24]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[25]  Cian O'Donnell,et al.  Adaptive Estimators Show Information Compression in Deep Neural Networks , 2019, ICLR.