MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels

Recent deep networks are capable of memorizing the entire data even when the labels are completely random. To overcome the overfitting on corrupted labels, we propose a novel technique of learning another neural network, called MentorNet, to supervise the training of the base deep networks, namely, StudentNet. During training, MentorNet provides a curriculum (sample weighting scheme) for StudentNet to focus on the sample the label of which is probably correct. Unlike the existing curriculum that is usually predefined by human experts, MentorNet learns a data-driven curriculum dynamically with StudentNet. Experimental results demonstrate that our approach can significantly improve the generalization performance of deep networks trained on corrupted training data. Notably, to the best of our knowledge, we achieve the best-published result on WebVision, a large benchmark containing 2.2 million images of real-world noisy labels. The code are at this https URL

[1]  Igor Bazovsky,et al.  Reliability Theory and Practice , 1961 .

[2]  J. Stenton,et al.  Learning how to teach. , 1973, Nursing mirror and midwives journal.

[3]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[4]  Michael J. Black,et al.  The Robust Estimation of Multiple Motions: Parametric and Piecewise-Smooth Flow Fields , 1996, Comput. Vis. Image Underst..

[5]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[6]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[7]  Kathrin Klamroth,et al.  Biconvex sets and optimization with biconvex functions: a survey and extensions , 2007, Math. Methods Oper. Res..

[8]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[9]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[10]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[11]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[13]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[14]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[15]  Bilge Mutlu,et al.  How Do Humans Teach: On Curriculum Learning and Teaching Dimension , 2011, NIPS.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Deva Ramanan,et al.  Self-Paced Learning for Long-Term Tracking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jieping Ye,et al.  A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems , 2013, ICML.

[19]  Antonio Torralba,et al.  Are all training examples equally valuable? , 2013, ArXiv.

[20]  Julien Mairal,et al.  Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization , 2013, NIPS.

[21]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[22]  Shiguang Shan,et al.  Self-Paced Learning with Diversity , 2014, NIPS.

[23]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[24]  Shiguang Shan,et al.  Self-Paced Curriculum Learning , 2015, AAAI.

[25]  Xinlei Chen,et al.  Webly Supervised Learning of Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Cheng Soon Ong,et al.  Learning from Corrupted Binary Labels via Class-Probability Estimation , 2015, ICML.

[27]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[28]  Deyu Meng,et al.  What Objective Does Self-paced Learning Indeed Optimize? , 2015, ArXiv.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[31]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Deyu Meng,et al.  Learning to Detect Concepts from Webly-Labeled Video Data , 2016, IJCAI.

[33]  Trevor Darrell,et al.  Auxiliary Image Regularization for Deep CNNs with Noisy Labels , 2015, ICLR.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[36]  Jaap Kamps,et al.  Avoiding Your Teacher's Mistakes: Training Neural Networks with Controlled Weak Supervision , 2017, ArXiv.

[37]  Ran He,et al.  Self-Paced Learning: An Implicit Regularization Perspective , 2016, AAAI.

[38]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[39]  David M. Blei,et al.  Robust Probabilistic Modeling with Bayesian Data Reweighting , 2016, ICML.

[40]  Yale Song,et al.  Learning from Noisy Labels with Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Arash Vahdat,et al.  Toward Robustness against Label Noise in Training Deep Discriminative Neural Networks , 2017, NIPS.

[42]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[43]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[44]  Stephen P. Boyd,et al.  Stochastic Mirror Descent in Variationally Coherent Optimization Problems , 2017, NIPS.

[45]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[46]  Wei Li,et al.  WebVision Database: Visual Learning and Understanding from Web Data , 2017, ArXiv.

[47]  Andrew McCallum,et al.  Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples , 2017, NIPS.

[48]  Qi Xie,et al.  Self-Paced Co-training , 2017, ICML.

[49]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[51]  Jacob Goldberger,et al.  Training deep neural-networks using a noise adaptation layer , 2016, ICLR.

[52]  Yannis Kalantidis,et al.  Tag Prediction at Flickr: A View from the Darkroom , 2016, ACM Multimedia.

[53]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[54]  Stephen P. Boyd,et al.  Mirror descent in non-convex stochastic programming , 2017, ArXiv.

[55]  Bernhard Schölkopf,et al.  Fidelity-Weighted Learning , 2017, ICLR.

[56]  Deyu Meng,et al.  On Convergence Property of Implicit Self-paced Objective , 2017, ArXiv.

[57]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[58]  Lei Zhang,et al.  CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Lei Zhang,et al.  Active Self-Paced Learning for Cost-Effective and Progressive Face Identification , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Nicu Sebe,et al.  Self Paced Deep Learning for Weakly Supervised Object Detection , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Yuxin Chen,et al.  Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval , 2018, Mathematical Programming.

[62]  Leopoldo Catania,et al.  Robust Estimation of a Location Parameter with the Integrated Hogg Function , 2019, SSRN Electronic Journal.

[63]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.