Classification with Noisy Labels by Importance Reweighting

In this paper, we study a classification problem in which sample labels are randomly corrupted. In this scenario, there is an unobservable sample with noise-free labels. However, before being observed, the true labels are independently flipped with a probability p E [0, 0.5), and the random label noise can be class-conditional. Here, we address two fundamental problems raised by this scenario. The first is how to best use the abundant surrogate loss functions designed for the traditional classification problem when there is label noise. We prove that any surrogate loss function can be used for classification with noisy labels by using importance reweighting, with consistency assurance that the label noise does not ultimately hinder the search for the optimal classifier of the noise-free sample. The other is the open problem of how to obtain the noise rate p. We show that the rate is upper bounded by the conditional probability P( ŶIX) of the noisy sample. Consequently, the rate can be estimated, because the upper bound can be easily reached in classification problems. Experimental results on synthetic and real datasets confirm the efficiency of our methods.

[1]  H. R. Moore Robust regression using maximum-likelihood weighting and assuming Cauchy-distributed random error. , 1977 .

[2]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[3]  Ming Li,et al.  Learning in the presence of malicious errors , 1993, STOC '88.

[4]  電子情報通信学会 IEICE transactions on fundamentals of electronics, communications and computer sciences , 1992 .

[5]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[6]  Tom Bylander,et al.  Learning linear threshold functions in the presence of classification noise , 1994, COLT '94.

[7]  Javed A. Aslam,et al.  On the Sample Complexity of Noise-Tolerant Learning , 1996, Inf. Process. Lett..

[8]  Edith Cohen,et al.  Learning noisy perceptrons by a perceptron in polynomial time , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[9]  Alan M. Frieze,et al.  A Polynomial-Time Algorithm for Learning Noisy Linear Threshold Functions , 1996, Algorithmica.

[10]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[11]  Nicolò Cesa-Bianchi,et al.  Sample-efficient strategies for learning in the presence of noise , 1999, JACM.

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[14]  Bernhard Schölkopf,et al.  Estimating a Kernel Fisher Discriminant in the Presence of Label Noise , 2001, ICML.

[15]  Eyal Kushilevitz,et al.  PAC learning with nasty noise , 1999, Theor. Comput. Sci..

[16]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[17]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[18]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[19]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[20]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[21]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[22]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[23]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[24]  Motoaki Kawanabe,et al.  Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation , 2007, NIPS.

[25]  Roni Khardon,et al.  Noise Tolerant Variants of the Perceptron Algorithm , 2007, J. Mach. Learn. Res..

[26]  Liva Ralaivola,et al.  Learning Kernel Perceptrons on Noisy Data Using Random Projections , 2007, ALT.

[27]  Weifeng Liu,et al.  Correntropy: Properties and Applications in Non-Gaussian Signal Processing , 2007, IEEE Transactions on Signal Processing.

[28]  C. Scott,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence , 2009 .

[29]  Rocco A. Servedio,et al.  Learning Halfspaces with Malicious Noise , 2009, ICALP.

[30]  Karsten M. Borgwardt,et al.  Covariate Shift by Kernel Mean Matching , 2009, NIPS 2009.

[31]  Frank Nielsen,et al.  Bregman Divergences and Surrogates for Learning , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  A Least-squares Approach to Direct Importance Estimation , 2009 .

[33]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..

[34]  Paul Tseng,et al.  Trace Norm Regularization: Reformulations, Algorithms, and Multi-Task Learning , 2010, SIAM J. Optim..

[35]  Koby Crammer,et al.  Learning via Gaussian Herding , 2010, NIPS.

[36]  Lorenzo Bruzzone,et al.  Domain Adaptation Problems: A DASVM Classification Technique and a Circular Validation Strategy , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Takafumi Kanamori,et al.  Theoretical Analysis of Density Ratio Estimation , 2010, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[38]  Gilles Blanchard,et al.  Semi-Supervised Novelty Detection , 2010, J. Mach. Learn. Res..

[39]  Masashi Sugiyama,et al.  Density Ratio Estimation: A Comprehensive Review , 2010 .

[40]  Ran He,et al.  Maximum Correntropy Criterion for Robust Face Recognition , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Nicolò Cesa-Bianchi,et al.  Online Learning of Noisy Data , 2011, IEEE Transactions on Information Theory.

[42]  Rocco A. Servedio,et al.  Learning large-margin halfspaces with more malicious noise , 2011, NIPS.

[43]  Sethuraman Panchanathan,et al.  A Two-Stage Weighting Framework for Multi-Source Domain Adaptation , 2011, NIPS.

[44]  Blaine Nelson,et al.  Support Vector Machines Under Adversarial Label Noise , 2011, ACML.

[45]  Rong Jin,et al.  Multiple Kernel Learning from Noisy Labels by Stochastic Programming , 2012, ICML.

[46]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[47]  C. Scott Calibrated asymmetric surrogate losses , 2012 .

[48]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[49]  Jieping Ye,et al.  A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems , 2013, ICML.

[50]  V. Vapnik,et al.  Constructive Setting of the Density Ratio Estimation Problem and its Rigorous Solution , 2013, 1306.0407.

[51]  Naresh Manwani,et al.  Noise Tolerance Under Risk Minimization , 2011, IEEE Transactions on Cybernetics.

[52]  Takafumi Kanamori,et al.  Relative Density-Ratio Estimation for Robust Distribution Comparison , 2011, Neural Computation.

[53]  Gilles Blanchard,et al.  Classification with Asymmetric Label Noise: Consistency and Maximal Denoising , 2013, COLT.

[54]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[55]  Clayton Scott,et al.  A Rate of Convergence for Mixture Proportion Estimation, with Application to Learning from Noisy Labels , 2015, AISTATS.