Does label smoothing mitigate label noise?

Label smoothing is commonly used in training deep learning models, wherein one-hot training labels are mixed with uniform label vectors. Empirically, smoothing has been shown to improve both predictive performance and model calibration. In this paper, we study whether label smoothing is also effective as a means of coping with label noise. While label smoothing apparently amplifies this problem --- being equivalent to injecting symmetric noise to the labels --- we show how it relates to a general family of loss-correction techniques from the label noise literature. Building on this connection, we show that label smoothing is competitive with loss-correction under label noise. Further, we show that when distilling models from noisy data, label smoothing of the teacher is beneficial; this is in contrast to recent findings for noise-free problems, and sheds further light on settings where label smoothing is beneficial.

[1]  Masashi Sugiyama,et al.  On Symmetric Losses for Learning from Corrupted Labels , 2019, ICML.

[2]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[5]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[6]  Gang Niu,et al.  Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[7]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[8]  Navdeep Jaitly,et al.  Towards Better Decoding and Language Model Integration in Sequence to Sequence Models , 2016, INTERSPEECH.

[9]  Rocco A. Servedio,et al.  Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[10]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[11]  Zachary Chase Lipton,et al.  Born Again Neural Networks , 2018, ICML.

[12]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[13]  Visar Berisha,et al.  Regularization via Structural Label Smoothing , 2020, AISTATS.

[14]  Raúl Santos-Rodríguez,et al.  Consistency of Losses for Learning from Weak Labels , 2014, ECML/PKDD.

[15]  Gang Niu,et al.  Are Anchor Points Really Indispensable in Label-Noise Learning? , 2019, NeurIPS.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[18]  Jeff A. Bilmes,et al.  Combating Label Noise in Deep Learning Using Abstention , 2019, ICML.

[19]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[20]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[21]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Robert C. Williamson,et al.  A Theory of Learning with Corrupted Labels , 2017, J. Mach. Learn. Res..

[23]  Rich Caruana,et al.  Model compression , 2006, KDD '06.

[24]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[25]  Ivor W. Tsang,et al.  Masking: A New Perspective of Noisy Supervision , 2018, NeurIPS.

[26]  Tong Zhang,et al.  Statistical Analysis of Some Multi-Category Large Margin Classification Methods , 2004, J. Mach. Learn. Res..

[27]  A. Dawid The Well-Calibrated Bayesian , 1982 .

[28]  Xingrui Yu,et al.  Pumpout: A Meta Approach for Robustly Training Deep Neural Networks with Noisy Labels , 2018, ArXiv.

[29]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[30]  Manfred K. Warmuth,et al.  Robust Bi-Tempered Logistic Loss Based on Bregman Divergences , 2019, NeurIPS.

[31]  Geoffrey E. Hinton,et al.  When Does Label Smoothing Help? , 2019, NeurIPS.

[32]  Frank Nielsen,et al.  Loss factorization, weakly supervised learning and label noise robustness , 2016, ICML.

[33]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[35]  Qi Tian,et al.  DisturbLabel: Regularizing CNN on the Loss Layer , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[37]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[38]  Aditya Krishna Menon,et al.  Learning with Symmetric Label Noise: The Importance of Being Unhinged , 2015, NIPS.

[39]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[40]  Ankit Singh Rawat,et al.  Can gradient clipping mitigate label noise? , 2020, ICLR.

[41]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[42]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.