Active cleaning of label noise

Mislabeled examples in the training data can severely affect the performance of supervised classifiers. In this paper, we present an approach to remove any mislabeled examples in the dataset by selecting suspicious examples as targets for inspection. We show that the large margin and soft margin principles used in support vector machines (SVM) have the characteristic of capturing the mislabeled examples as support vectors. Experimental results on two character recognition datasets show that one-class and two-class SVMs are able to capture around 85% and 99% of label noise examples, respectively, as their support vectors. We propose another new method that iteratively builds two-class SVM classifiers on the non-support vector examples from the training data followed by an expert manually verifying the support vectors based on their classification score to identify any mislabeled examples. We show that this method reduces the number of examples to be reviewed, as well as providing parameter independence of this method, through experimental results on four data sets. So, by (re-)examining the labels of the selective support vectors, most noise can be removed. This can be quite advantageous when rapidly building a labeled data set. HighlightsNovel method for label noise removal from data is introduced.It significantly reduces the required number of examples to be reviewed.Support vectors of SVM classifier can capture around 99% of label noise examples.Two-class SVM captures more label noise examples than one-class SVM classifierCombination of one-class and two-class SVM produces a marginal improvement.

[1]  Stefanie Nowak,et al.  Using one-class SVM outliers detection for verification of collaboratively tagged image training sets , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[2]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[3]  Sam Kwong,et al.  A noise-detection based AdaBoost algorithm for mislabeled data , 2012, Pattern Recognit..

[4]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[5]  Pang-Ning Tan,et al.  Kernel Based Detection of Mislabeled Training Examples , 2007, SDM.

[6]  Liva Ralaivola,et al.  Learning SVMs from Sloppily Labeled Data , 2009, ICANN.

[7]  Fabrice Muhlenbach,et al.  Identifying and Handling Mislabelled Instances , 2004, Journal of Intelligent Information Systems.

[8]  J. Weston,et al.  Support Vector Machine Solvers , 2007 .

[9]  H WittenIan,et al.  The WEKA data mining software , 2009 .

[10]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[11]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[12]  Blaine Nelson,et al.  Support Vector Machines Under Adversarial Label Noise , 2011, ACML.

[13]  G DietterichThomas An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees , 2000 .

[14]  Carole Lartizien,et al.  Handling uncertainties in SVM classification , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[18]  Kanishka Bhaduri,et al.  Distributed anomaly detection using 1‐class SVM for vertically partitioned data , 2011, Stat. Anal. Data Min..

[19]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[20]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[21]  Carla E. Brodley,et al.  Challenges and Opportunities in Applied Machine Learning , 2012, AI Mag..

[22]  Carla E. Brodley,et al.  Strategic targeting of outliers for expert review , 2010 .

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Lawrence O. Hall,et al.  Label-noise reduction with support vector machines , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[25]  Saso Dzeroski,et al.  Noise detection and elimination in data preprocessing: Experiments in medical domains , 2000, Appl. Artif. Intell..

[26]  Khoa N. Le A mathematical approach to edge detection in hyperbolic-distributed and Gaussian-distributed pixel-intensity images using hyperbolic and Gaussian masks , 2011, Digit. Signal Process..

[27]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[28]  Fariborz Mahmoudi,et al.  Robust Handwritten Character Recognition with Features Inspired by Visual Ventral Stream , 2008, Neural Processing Letters.

[29]  Isabelle Guyon,et al.  Discovering Informative Patterns and Data Cleaning , 1996, Advances in Knowledge Discovery and Data Mining.

[30]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[31]  Rebecca Castano,et al.  Improving onboard analysis of Hyperion images by filtering mislabeled training data examples , 2009, 2009 IEEE Aerospace conference.

[32]  Ana I. González Acuña An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization , 2012 .

[33]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[34]  Carla E. Brodley,et al.  Class Noise Mitigation Through Instance Weighting , 2007, ECML.

[35]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[36]  Carla E. Brodley,et al.  Generating High-Quality Training Data for Automated Land-Cover Mapping , 2008, IGARSS 2008 - 2008 IEEE International Geoscience and Remote Sensing Symposium.

[37]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[38]  Janaina Mourão Miranda,et al.  Patient classification as an outlier detection problem: An application of the One-Class Support Vector Machine , 2011, NeuroImage.