Co-teaching: Robust training of deep neural networks with extremely noisy labels

Deep learning with noisy labels is practically challenging, as the capacity of deep models is so high that they can totally memorize these noisy labels sooner or later during training. Nonetheless, recent studies on the memorization effects of deep neural networks show that they would first memorize training data of clean labels and then those of noisy labels. Therefore in this paper, we propose a new deep learning paradigm called Co-teaching for combating with noisy labels. Namely, we train two deep neural networks simultaneously, and let them teach each other given every mini-batch: firstly, each network feeds forward all data and selects some data of possibly clean labels; secondly, two networks communicate with each other what data in this mini-batch should be used for training; finally, each network back propagates the data selected by its peer network and updates itself. Empirical results on noisy versions of MNIST, CIFAR-10 and CIFAR-100 demonstrate that Co-teaching is much superior to the state-of-the-art methods in the robustness of trained deep models.

[1]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[2]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[3]  Yale Song,et al.  Learning from Noisy Labels with Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  John Langford,et al.  Agnostic active learning , 2006, J. Comput. Syst. Sci..

[5]  Wei Liu,et al.  Teaching-to-Learn and Learning-to-Teach for Multi-label Propagation , 2016, AAAI.

[6]  Richard Nock,et al.  Making Deep Neural Networks Robust to Label Noise: A Loss Correction Approach , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Andrew M. Dai,et al.  Virtual Adversarial Training for Semi-Supervised Text Classification , 2016, ArXiv.

[8]  Gang Niu,et al.  Positive-Unlabeled Learning with Non-Negative Risk Estimator , 2017, NIPS.

[9]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[10]  Clayton Scott,et al.  Class Proportion Estimation with Application to Multiclass Anomaly Rejection , 2013, AISTATS.

[11]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[12]  Dacheng Tao,et al.  Classification with Noisy Labels by Importance Reweighting , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Li Fei-Fei,et al.  MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels , 2017, ICML.

[14]  Shai Shalev-Shwartz,et al.  Decoupling "when to update" from "how to update" , 2017, NIPS.

[15]  Jacob Goldberger,et al.  Training deep neural-networks using a noise adaptation layer , 2016, ICLR.

[16]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[17]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Dacheng Tao,et al.  Learning with Biased Complementary Labels , 2017, ECCV.

[19]  Subramanian Ramanathan,et al.  Learning from multiple annotators with varying expertise , 2013, Machine Learning.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Dacheng Tao,et al.  An Efficient and Provable Approach for Mixture Proportion Estimation Using Linear Independence Assumption , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[23]  James Bailey,et al.  Dimensionality-Driven Learning with Noisy Labels , 2018, ICML.

[24]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[25]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[28]  Jonathan Krause,et al.  Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  J. Stenton,et al.  Learning how to teach. , 1973, Nursing mirror and midwives journal.

[30]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Aditya Krishna Menon,et al.  Learning with Symmetric Label Noise: The Importance of Being Unhinged , 2015, NIPS.

[32]  Le Song,et al.  Iterative Learning with Open-set Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Li Fei-Fei,et al.  MentorNet: Regularizing Very Deep Neural Networks on Corrupted Labels , 2017, ArXiv.

[34]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[35]  Kiyoharu Aizawa,et al.  Joint Optimization Framework for Learning with Noisy Labels , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[37]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[38]  Francisco C. Pereira,et al.  Deep learning from crowds , 2017, AAAI.

[39]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[40]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[41]  Cheng Soon Ong,et al.  Learning from Corrupted Binary Labels via Class-Probability Estimation , 2015, ICML.

[42]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[43]  Yoav Freund,et al.  A Short Introduction to Boosting , 1999 .

[44]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[45]  Nuno Vasconcelos,et al.  On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost , 2008, NIPS.