Learning with Noisy Labels

In this paper, we theoretically study the problem of binary classification in the presence of random classification noise—the learner, instead of seeing the true labels, sees labels that have independently been flipped with some small probability. Moreover, random label noise is class-conditional— the flip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that the method leads to an efficient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence — methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88% accuracy even when 40% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.

[1]  Tom Bylander,et al.  Learning linear threshold functions in the presence of classification noise , 1994, COLT '94.

[2]  Javed A. Aslam,et al.  On the Sample Complexity of Noise-Tolerant Learning , 1996, Inf. Process. Lett..

[3]  Nicolò Cesa-Bianchi,et al.  Sample-efficient strategies for learning in the presence of noise , 1999, JACM.

[4]  Thore Graepel,et al.  The Kernel Gibbs Sampler , 2000, NIPS.

[5]  Bernhard Schölkopf,et al.  Estimating a Kernel Fisher Discriminant in the Presence of Label Noise , 2001, ICML.

[6]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[7]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[8]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[9]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[10]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[11]  Liva Ralaivola,et al.  Learning from Noisy Data using Hyperplane Sampling and Sample Averages , 2007 .

[12]  Roni Khardon,et al.  Noise Tolerant Variants of the Perceptron Algorithm , 2007, J. Mach. Learn. Res..

[13]  Liva Ralaivola,et al.  Learning Kernel Perceptrons on Noisy Data Using Random Projections , 2007, ALT.

[14]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[15]  Rocco A. Servedio,et al.  Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[16]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[17]  Shai Ben-David,et al.  Agnostic Online Learning , 2009, COLT.

[18]  Liva Ralaivola,et al.  Learning SVMs from Sloppily Labeled Data , 2009, ICANN.

[19]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[20]  Yves Lucet,et al.  What Shape Is Your Conjugate? A Survey of Computational Convex Analysis and Its Applications , 2009, SIAM J. Optim..

[21]  Koby Crammer,et al.  Learning via Gaussian Herding , 2010, NIPS.

[22]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[23]  Nicolò Cesa-Bianchi,et al.  Online Learning of Noisy Data , 2011, IEEE Transactions on Information Theory.

[24]  Blaine Nelson,et al.  Support Vector Machines Under Adversarial Label Noise , 2011, ACML.

[25]  C. Scott Calibrated asymmetric surrogate losses , 2012 .

[26]  Naresh Manwani,et al.  Noise Tolerance Under Risk Minimization , 2011, IEEE Transactions on Cybernetics.

[27]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[28]  Koby Crammer,et al.  Adaptive regularization of weight vectors , 2009, Machine Learning.