An Investigation of how Label Smoothing Affects Generalization

It has been hypothesized that label smoothing can reduce overfitting and improve generalization, and current empirical evidence seems to corroborate these effects. However, there is a lack of mathematical understanding of when and why such empirical improvements occur. In this paper, as a step towards understanding why label smoothing is effective, we propose a theoretical framework to show how label smoothing provides in controlling the generalization loss. In particular, we show that this benefit can be precisely formulated and identified in the label noise setting, where the training is partially mislabeled. Our theory also predicts the existence of an optimal label smoothing point, a single value for the label smoothing hyperparameter that minimizes generalization loss. Extensive experiments are done to confirm the predictions of our theory. We believe that our findings will help both theoreticians and practitioners understand label smoothing, and better apply them to real-world datasets.

[1]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[2]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[3]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[4]  Mikhail Khodak,et al.  A Theoretical Analysis of Contrastive Unsupervised Representation Learning , 2019, ICML.

[5]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[6]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[7]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[8]  Ruslan Salakhutdinov,et al.  Learning Not to Learn in the Presence of Noisy Labels , 2020, ArXiv.

[9]  Tom Goldstein,et al.  Label Smoothing and Logit Squeezing: A Replacement for Adversarial Training? , 2019, ArXiv.

[10]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[11]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ruslan Salakhutdinov,et al.  A Simple Approach to the Noisy Label Problem Through the Gambler's Loss , 2019 .

[13]  Liva Ralaivola,et al.  Efficient learning of Naive Bayes classifiers under class-conditional classification noise , 2006, ICML.

[14]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15]  Xingrui Yu,et al.  Co-teaching: Robust training of deep neural networks with extremely noisy labels , 2018, NeurIPS.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Ruslan Salakhutdinov,et al.  Deep Gamblers: Learning to Abstain with Portfolio Theory , 2019, NeurIPS.

[18]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[19]  Valentina Zantedeschi,et al.  Efficient Defenses Against Adversarial Attacks , 2017, AISec@CCS.

[20]  Ryan Cotterell,et al.  Generalized Entropy Regularization or: There’s Nothing Special about Label Smoothing , 2020, ACL.

[21]  Louis-Philippe Morency,et al.  Multimodal Language Analysis with Recurrent Multistage Fusion , 2018, EMNLP.

[22]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[24]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Aditya Krishna Menon,et al.  Does label smoothing mitigate label noise? , 2020, ICML.

[26]  Jun Sun,et al.  Safeguarded Dynamic Label Regression for Noisy Supervision , 2019, AAAI.

[27]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[28]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.