Dealing with Noisy Data

This chapter focuses on the noise imperfections of the data. The presence of noise in data is a common problem that produces several negative consequences in classification problems. Noise is an unavoidable problem, which affects the data collection and data preparation processes in Data Mining applications, where errors commonly occur. The performance of the models built under such circumstances will heavily depend on the quality of the training data, but also on the robustness against the noise of the model learner itself. Hence, problems containing noise are complex problems and accurate solutions are often difficult to achieve without using specialized techniques—particularly if they are noise-sensitive. Identifying the noise is a complex task that will be developed in Sect. 5.1. Once the noise has been identified, the different kinds of such an imperfection are described in Sect. 5.2. From this point on, the two main approaches carried out in the literature are described. On the first hand, modifying and cleaning the data is studied in Sect. 5.3, whereas designing noise robust Machine Learning algorithms is tackled in Sect. 5.4. An empirical comparison between the latest approaches in the specialized literature is made in Sect. 5.5.

[1]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[3]  José Salvador Sánchez,et al.  On the k-NN performance in a challenging scenario of imbalance and overlapping , 2008, Pattern Analysis and Applications.

[4]  Kiichi Urahama,et al.  Error-correcting semi-supervised pattern recognition with mode filter on graphs , 2010, 2010 2nd International Symposium on Aware Computing.

[5]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[6]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[7]  Tony R. Martinez,et al.  Improving classification accuracy by identifying and removing instances that should be misclassified , 2011, The 2011 International Joint Conference on Neural Networks.

[8]  D. Angluin,et al.  Learning From Noisy Examples , 1988, Machine Learning.

[9]  Shifu Chen,et al.  Identifying and Correcting Mislabeled Training Instances , 2007, Future Generation Communication and Networking (FGCN 2007).

[10]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[11]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[12]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[13]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[14]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[15]  Virginia Wheway,et al.  Using Boosting to Detect Noisy Data , 2000, PRICAI Workshops.

[16]  Piero P. Bonissone,et al.  A fuzzy random forest , 2010, Int. J. Approx. Reason..

[17]  Xindong Wu Knowledge Acquisition from Databases , 1995 .

[18]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[19]  Szymon Wilk,et al.  Learning from Imbalanced Data in Presence of Noisy and Borderline Examples , 2010, RSCTC.

[20]  Ata Kabán,et al.  Multi-class classification in the presence of labelling errors , 2011, ESANN.

[21]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[22]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[23]  José Martínez Sotoca,et al.  Combined Effects of Class Imbalance and Class Overlap on Instance-Based Classification , 2006, IDEAL.

[24]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[25]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Stephen Kwek,et al.  A boosting approach to remove class label noise , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[27]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[28]  R. Haralick The table look-up rule , 1976 .

[29]  Raymond J. Mooney,et al.  Experiments on Ensembles with Missing and Noisy Data , 2004, Multiple Classifier Systems.

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  Johannes Fürnkranz,et al.  Round Robin Classification , 2002, J. Mach. Learn. Res..

[32]  Sadaaki Miyamoto,et al.  Rough Sets and Current Trends in Computing , 2012, Lecture Notes in Computer Science.

[33]  J. Gama,et al.  A study on Error Correcting Output Codes , 2005, 2005 portuguese conference on artificial intelligence.

[34]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[35]  David W. Opitz,et al.  An Empirical Evaluation of Bagging and Boosting , 1997, AAAI/IAAI.

[36]  Taghi M. Khoshgoftaar,et al.  Analyzing software measurement data with clustering techniques , 2004, IEEE Intelligent Systems.

[37]  Xindong Wu,et al.  Class Noise Handling for Effective Cost-Sensitive Learning by Cost-Guided Iterative Classification Filtering , 2006, IEEE Transactions on Knowledge and Data Engineering.

[38]  Enrico Blanzieri,et al.  Detecting potential labeling errors in microarrays by data perturbation , 2006, Bioinform..

[39]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[40]  Andrés R. Masegosa,et al.  Bagging schemes on the presence of class noise in classification , 2012, Expert Syst. Appl..

[41]  Naresh Manwani,et al.  Noise Tolerance Under Risk Minimization , 2011, IEEE Transactions on Cybernetics.

[42]  Tom Heskes The Use of Being Stubborn and Introspective , 2000 .

[43]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[44]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[45]  Andrés R. Masegosa,et al.  Bagging Decision Trees on Data Sets with Classification Noise , 2010, FoIKS.

[46]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[47]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[48]  Ching Y. Suen,et al.  A Method of Combining Multiple Experts for the Recognition of Unconstrained Handwritten Numerals , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  A review on the combination of binary classifiers in multiclass problems , 2008, Artificial Intelligence Review.

[50]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Use of Classification Algorithms in Noise Detection and Elimination , 2009, HAIS.

[51]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[52]  Saso Dzeroski,et al.  Noise detection and elimination in data preprocessing: Experiments in medical domains , 2000, Appl. Artif. Intell..

[53]  Beata Beigman Klebanov,et al.  Some Empirical Evidence for Annotation Noise in a Benchmarked Dataset , 2010, HLT-NAACL.

[54]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[55]  Chen Zhang,et al.  Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model , 2009, Bioinform..

[56]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[57]  Enrico Blanzieri,et al.  Noise reduction for instance-based learning with a local maximal margin approach , 2010, Journal of Intelligent Information Systems.

[58]  Khaled Rasheed,et al.  Foreign exchange market prediction with multiple classifiers , 2009 .

[59]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[60]  Kishan G. Mehrotra,et al.  Efficient classification for multiclass problems using modular neural networks , 1995, IEEE Trans. Neural Networks.

[61]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[62]  Tin Kam Ho,et al.  MULTIPLE CLASSIFIER COMBINATION: LESSONS AND NEXT STEPS , 2002 .

[63]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[64]  Vladimir D. Mazurov,et al.  Solving of optimization and identification problems by the committee methods , 1987, Pattern Recognit..

[65]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[66]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[67]  David W. Aha,et al.  Noise-Tolerant Instance-Based Learning Algorithms , 1989, IJCAI.

[68]  Rajiv Kumar Nath FINGERPRINT RECOGNITION USING MULTIPLE CLASSIFIER SYSTEM , 2007 .

[69]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[70]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[71]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[72]  Emilio Corchado,et al.  Intelligent Data Engineering and Automated Learning - IDEAL 2006, 7th International Conference, Burgos, Spain, September 20-23, 2006, Proceedings , 2006, IDEAL.

[73]  Ujjwal Maulik,et al.  A Robust Multiple Classifier System for Pixel Classification of Remote Sensing Images , 2010, Fundam. Informaticae.

[74]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[75]  Helge Ritter,et al.  Prerational Intelligence: Adaptive Behavior and intelligent systems without symbols and logic. Vol. 1 , 2000 .

[76]  Ryan M. Rifkin,et al.  In Defense of One-Vs-All Classification , 2004, J. Mach. Learn. Res..

[77]  Eddy Mayoraz,et al.  On the Decomposition of Polychotomies into Dichotomies , 1997, ICML.

[78]  Katia Kermanidis,et al.  The effect of borderline examples on language learning , 2009, J. Exp. Theor. Artif. Intell..

[79]  Oral Alan,et al.  Class noise detection based on software metrics and ROC curves , 2011, Inf. Sci..

[80]  Eyke Hüllermeier,et al.  Combining predictions in pairwise classification: An optimal adaptive voting strategy and its relation to weighted voting , 2010, Pattern Recognit..

[81]  Lance Chun Che Fung,et al.  Data Cleaning for Classification Using Misclassification Analysis , 2010, J. Adv. Comput. Intell. Intell. Informatics.

[82]  Xindong Wu,et al.  Bridging Local and Global Data Cleansing: Identifying Class Noise in Large, Distributed Data Datasets , 2006, Data Mining and Knowledge Discovery.

[83]  Gérard Dreyfus,et al.  Handwritten digit recognition by neural networks with single-layer training , 1992, IEEE Trans. Neural Networks.

[84]  Daniel Hernández-Lobato,et al.  Robust Multi-Class Gaussian Process Classification , 2011, NIPS.

[85]  K D Wernecke,et al.  A coupling procedure for the discrimination of mixed data. , 1992, Biometrics.

[86]  Yanchun Zhang,et al.  Support Vector Machine for Outlier Detection in Breast Cancer Survivability Prediction , 2008, APWeb Workshops.

[87]  Fabrice Muhlenbach,et al.  Identifying and Handling Mislabelled Instances , 2004, Journal of Intelligent Information Systems.

[88]  Ludmila I. Kuncheva Diversity in multiple classifier systems , 2005, Inf. Fusion.

[89]  Choh-Man Teng,et al.  Polishing Blemishes: Issues in Data Correction , 2004, IEEE Intell. Syst..

[90]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[91]  D. Titterington,et al.  Comparison of Discrimination Techniques Applied to a Complex Data Set of Head Injured Patients , 1981 .

[92]  Xindong Wu,et al.  Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets , 2004, AAAI.

[93]  Francisco Javier Girón González-Torre,et al.  Misclassified multinomial data: a Bayesian approach , 2007 .

[94]  E. Mandler,et al.  Combining the Classification Results of Independent Classifiers Based on the Dempster/Shafer Theory of Evidence , 1988 .

[95]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[96]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[97]  Padraig Cunningham,et al.  An Analysis of Case-Base Editing in a Spam Filtering System , 2004, ECCBR.

[98]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[99]  Gérard Dreyfus,et al.  Single-layer learning revisited: a stepwise procedure for building and training a neural network , 1989, NATO Neurocomputing.

[100]  Taghi M. Khoshgoftaar,et al.  Improving Software Quality Prediction by Noise Filtering Techniques , 2007, Journal of Computer Science and Technology.

[101]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[102]  Francisco Herrera,et al.  An overview of ensemble methods for binary classifiers in multi-class problems: Experimental study on one-vs-one and one-vs-all schemes , 2011, Pattern Recognit..

[103]  Francisco Herrera,et al.  Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification , 2013, Pattern Recognit..

[104]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[105]  José Salvador Sánchez,et al.  An Empirical Study of the Behavior of Classifiers on Imbalanced and Overlapped Data Sets , 2007, CIARP.

[106]  L. Shapley,et al.  Optimizing group judgmental accuracy in the presence of interdependencies , 1984 .

[107]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.