Class Noise vs. Attribute Noise: A Quantitative Study

Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created from the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, time in building a classifier and the size of the classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A more reasonable solution might be to employ some preprocessing mechanisms to handle noisy instances before a learner is formed. Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms.

[1]  Xindong Wu,et al.  Error Detection and Impact-Sensitive Instance Ranking in Noisy Datasets , 2004, AAAI.

[2]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[3]  D. Altman,et al.  Missing data , 2007, BMJ : British Medical Journal.

[4]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[5]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[6]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[7]  Rajesh N. Davé,et al.  Characterization and detection of noise in clustering , 1991, Pattern Recognit. Lett..

[8]  Ray J. Hickey,et al.  Artificial Intelligence Noise modelling and evaluating learning from examples , 2003 .

[9]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[10]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[11]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[12]  Bruno Crémilleux,et al.  MVC - a preprocessing method to deal with missing values , 1999, Knowl. Based Syst..

[13]  Robert C. Holte,et al.  Very Simple Classification Rules Perform Well on Most Commonly Used Datasets , 1993, Machine Learning.

[14]  Jadzia Cendrowska,et al.  PRISM: An Algorithm for Inducing Modular Rules , 1987, Int. J. Man Mach. Stud..

[15]  James Joseph Biundo,et al.  Analysis of Contingency Tables , 1969 .

[16]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[17]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[18]  Natsuki Oka,et al.  Learning regular and irregular examples separately , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[19]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[20]  Malik Beshir Malik,et al.  Applied Linear Regression , 2005, Technometrics.

[21]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[22]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[23]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[24]  Ivan Bruha Unknown Attribute Values Processing by Meta-learner , 2002, ISMIS.

[25]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[26]  Frantisek Franek,et al.  Comparison of Various Routines for Unknown Attribute Value Processing The Covering Paradigm , 1996, Int. J. Pattern Recognit. Artif. Intell..

[27]  Ashwin Srinivasan,et al.  Distinguishing Exceptions From Noise in Non-Monotonic Learning , 1992 .

[28]  J. Ross Quinlan,et al.  Unknown Attribute Values in Induction , 1989, ML.

[29]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[30]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[31]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[32]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[33]  Alex Alves Freitas,et al.  Understanding the Crucial Role of Attribute Interaction in Data Mining , 2001, Artificial Intelligence Review.

[34]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[35]  Andrew W. Moore,et al.  Probabilistic noise identification and data cleaning , 2003, Third IEEE International Conference on Data Mining.

[36]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[37]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[38]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[39]  Isabelle Guyon,et al.  Discovering Informative Patterns and Data Cleaning , 1996, Advances in Knowledge Discovery and Data Mining.

[40]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[41]  Qi Zhao,et al.  Using Qualitative Hypotheses to Identify Inaccurate Data , 1995, J. Artif. Intell. Res..

[42]  D. Wolpert On Overfitting Avoidance as Bias , 1993 .

[43]  Cullen Schaffer Sparse Data and the Effect of Overfitting Avoidance in Decision Tree Induction , 1992, AAAI.

[44]  Saso Dzeroski,et al.  Noise detection and elimination in data preprocessing: Experiments in medical domains , 2000, Appl. Artif. Intell..

[45]  Kunio Yoshida,et al.  A Noise-Tolerant Hybrid Model of A Global and A Local Learning Module , 1999 .