New label noise injection methods for the evaluation of noise filters

Abstract Noise is often present in real datasets used for training Machine Learning classifiers. Their disruptive effects in the learning process may include: increasing the complexity of the induced models, a higher processing time and a reduced predictive power in the classification of new examples. Therefore, treating noisy data in a preprocessing step is crucial for improving data quality and to reduce their harmful effects in the learning process. There are various filters using different concepts for identifying noisy examples in a dataset. Their ability in noise preprocessing is usually assessed in the identification of artificial noise injected into one or more datasets. This is performed to overcome the limitation that only a domain expert can guarantee whether a real example is indeed noisy. The most frequently used label noise injection method is the noise at random method, in which a percentage of the training examples have their labels randomly exchanged. This is carried out regardless of the characteristics and example space positions of the selected examples. This paper proposes two novel methods to inject label noise in classification datasets. These methods, based on complexity measures, can produce more challenging and realistic noisy datasets by the disturbance of the labels of critical examples situated close to the decision borders and can improve the noise filtering evaluation. An extensive experimental evaluation of different noise filters is performed using public datasets with imputed label noise and the influence of the noise injection methods are compared in both data preprocessing and classification steps.

[1]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Noise detection in the meta-learning level , 2016, Neurocomputing.

[2]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[3]  Francisco Herrera,et al.  SMOTE-IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering , 2015, Inf. Sci..

[4]  Francisco Herrera,et al.  INFFC: An iterative class noise filter based on the fusion of classifiers with noise sensitivity control , 2016, Inf. Fusion.

[5]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[6]  Nada Lavrac,et al.  Ensemble-based noise detection: noise ranking and visual performance evaluation , 2012, Data Mining and Knowledge Discovery.

[7]  Nada Lavrac,et al.  Advances in Class Noise Detection , 2010, ECAI.

[8]  Sofie Verbaeten,et al.  Identifying mislabeled training examples in ILP Classification Problems , 2002 .

[9]  Etienne Barnard,et al.  Measures for the characterisation of pattern-recognition data sets , 2007 .

[10]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[11]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A Study on Class Noise Detection and Elimination , 2012, 2012 Brazilian Symposium on Neural Networks.

[12]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data: Methods and Models , 2009 .

[13]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[14]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Effect of label noise in the complexity of classification problems , 2015, Neurocomputing.

[15]  Nada Lavrac,et al.  Conditions for Occam's Razor Applicability and Noise Elimination , 1997, ECML.

[16]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[17]  Carole Lartizien,et al.  Handling uncertainties in SVM classification , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[18]  Ana Carolina Lorena,et al.  Analysis of complexity indices for classification problems: Cancer gene expression data , 2012, Neurocomputing.

[19]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[20]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[21]  Xindong Wu,et al.  Mining With Noise Knowledge: Error-Aware Data Mining , 2008, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[23]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[24]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[25]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[26]  Stan Matwin,et al.  Ensembles of label noise filters: a ranking approach , 2016, Data Mining and Knowledge Discovery.

[27]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[28]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[29]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[30]  Muddassar Farooq,et al.  Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets , 2009, IWLCS.

[31]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[32]  Bartosz Krawczyk,et al.  On the Influence of Class Noise in Medical Data Classification: Treatment Using Noise Filtering Methods , 2016, Appl. Artif. Intell..

[33]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[34]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A review on the combination of binary classifiers in multiclass problems , 2008, Artificial Intelligence Review.

[35]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Núria Macià,et al.  Towards UCI+: A mindful repository design , 2014, Inf. Sci..

[37]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[38]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[39]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[40]  Tony R. Martinez,et al.  An instance level analysis of data complexity , 2014, Machine Learning.

[41]  Taghi M. Khoshgoftaar,et al.  Generating multiple noise elimination filters with the ensemble-partitioning filter , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[42]  R. Chhikara,et al.  Linear discriminant analysis with misallocation in training samples , 1984 .

[43]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[44]  Francisco Herrera,et al.  Using the One-vs-One decomposition to improve the performance of class noise filters via an aggregation strategy in multi-class classification problems , 2015, Knowl. Based Syst..

[45]  Padhraic Smyth,et al.  Knowledge Discovery and Data Mining: Towards a Unifying Framework , 1996, KDD.

[46]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[47]  Xindong Wu Knowledge Acquisition from Databases , 1995 .

[48]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[49]  Ana Carolina Lorena,et al.  Evaluation of noise reduction techniques in the splice junction recognition problem , 2004 .

[50]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[51]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.