Novel mislabeled training data detection algorithm

As a kind of noise, mislabeled training data exist in many applications. Because of their negative effects on learning, many filter techniques have been proposed to identify and eliminate them. Ensemble learning-based filter (EnFilter) is the most widely used filter which employs ensemble classifiers. In EnFilter, first the noisy training dataset is divided into several subsets. Each noisy subset is then checked by the multiple classifiers which are trained based on other noisy subsets. It is noted that since the training data used to train multiple classifiers are noisy, the quality of these classifiers cannot be guaranteed, which might generate poor noise identification result. This problem is more serious when the noise ratio in the training dataset is high. To solve this problem, a straightforward but effective approach is proposed in this work. Instead of using noisy data to train the classifiers, nearly noise-free (NNF) data are used since they are supposed to train more reliable classifiers. To this end, a novel NNF data extraction approach is also proposed. Experimental results on a set of benchmark datasets illustrate the utility of our proposed approach.

[1]  Serafín Moral,et al.  Building classification trees using the total uncertainty criterion , 2003, Int. J. Intell. Syst..

[2]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[3]  Donghai Guan,et al.  An empirical study of filter-based feature selection algorithms using noisy training data , 2014, 2014 4th IEEE International Conference on Information Science and Technology.

[4]  Mykola Pechenizkiy,et al.  Class Noise and Supervised Learning in Medical Domains: The Effect of Feature Extraction , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[5]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[6]  Bin Gu,et al.  Incremental Support Vector Learning for Ordinal Regression , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[7]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[8]  Yingtao Bi,et al.  The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise , 2010, J. Multivar. Anal..

[9]  Taghi M. Khoshgoftaar,et al.  Knowledge discovery from imbalanced and noisy data , 2009, Data Knowl. Eng..

[10]  Bin Gu,et al.  Incremental learning for ν-Support Vector Regression , 2015, Neural Networks.

[11]  Dimitris N. Metaxas,et al.  Distinguishing mislabeled data from correctly labeled data in classifier design , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[12]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[13]  Zhi-Hua Zhou,et al.  Editing Training Data for kNN Classifiers with Neural Network Ensemble , 2004, ISNN.

[14]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[15]  Yiming Yang,et al.  Robustness of regularized linear classification methods in text categorization , 2003, SIGIR.

[16]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[17]  Xingming Sun,et al.  Structural Minimax Probability Machine , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Tinghuai Ma,et al.  Social Network and Tag Sources Based Augmenting Collaborative Recommender System , 2015, IEICE Trans. Inf. Syst..

[19]  Donghai Guan,et al.  Nearest neighbor editing aided by unlabeled data , 2009, Inf. Sci..

[20]  Carla E. Brodley,et al.  Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[21]  Bin Gu,et al.  A Robust Regularization Path Algorithm for $\nu $ -Support Vector Classification , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[22]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Sébastien Ourselin,et al.  Wrapper Methods to Correct Mislabelled Training Data , 2013, 2013 International Workshop on Pattern Recognition in Neuroimaging.

[24]  Ray J. Hickey,et al.  Artificial Intelligence Noise modelling and evaluating learning from examples , 2003 .

[25]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[26]  Bidyut Baran Chaudhuri,et al.  A new definition of neighborhood of a point in multi-dimensional space , 1996, Pattern Recognit. Lett..

[27]  Francisco Herrera,et al.  On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification , 2014, Neurocomputing.

[28]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[29]  Ata Kabán,et al.  Classification of mislabelled microarrays using robust sparse logistic regression , 2013, Bioinform..

[30]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[31]  Xindong Wu,et al.  Dynamic classifier selection for effective mining from noisy data streams , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[32]  Donghai Guan,et al.  Identifying mislabeled training data with the aid of unlabeled data , 2011, Applied Intelligence.

[33]  Naresh Manwani,et al.  A Team of Continuous-Action Learning Automata for Noise-Tolerant Learning of Half-Spaces , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[34]  Tinghuai Ma,et al.  Detecting potential labeling errors for bioinformatics by multiple voting , 2014, Knowl. Based Syst..

[35]  Ling Shao,et al.  A rapid learning algorithm for vehicle classification , 2015, Inf. Sci..

[36]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[37]  Beata Beigman Klebanov,et al.  Learning with Annotation Noise , 2009, ACL.

[38]  Francisco Herrera,et al.  A First Study on Decomposition Strategies with Data with Class Noise Using Decision Trees , 2012, HAIS.

[39]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[40]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[41]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[42]  Naresh Manwani,et al.  Noise Tolerance Under Risk Minimization , 2011, IEEE Transactions on Cybernetics.

[43]  Victor S. Sheng,et al.  Label noise correction methods , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[44]  Andrés R. Masegosa,et al.  Bagging Decision Trees on Data Sets with Classification Noise , 2010, FoIKS.