Identifying Mislabeled Training Data

This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classifiers that serve as noise filters for the training data. We evaluate single algorithm, majority vote and consensus filters on five datasets that are prone to labeling errors. Our experiments illustrate that filtering significantly improves classification accuracy for noise levels up to 30%. An analytical and empirical evaluation of the precision of our approach shows that consensus filters are conservative at throwing away good data at the expense of retaining bad data and that majority filters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus filters are preferable, whereas majority vote filters are preferable for situations with an abundance of data.

[1]  Patrick Henry Winston,et al.  Learning structural descriptions from examples , 1970 .

[2]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  I. Tomek An Experiment with the Edited Nearest-Neighbor Rule , 1976 .

[6]  S. Weisberg Applied Linear Regression , 1981 .

[7]  T. Hassard,et al.  Applied Linear Regression , 2005 .

[8]  E. Matthews Global Vegetation and Land Use: New High-Resolution Data Bases for Climate Studies , 1983 .

[9]  A. Henderson‐sellers,et al.  A global archive of land cover and soils data for use in general circulation climate models , 1985 .

[10]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Edwina L. Rissland,et al.  Inductive Learning in a Mixed Paradigm Setting , 1990, AAAI.

[12]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[13]  Jon Atli Benediktsson,et al.  Consensus theoretic classification methods , 1992, IEEE Trans. Syst. Man Cybern..

[14]  Ashwin Srinivasan,et al.  Distinguishing Exceptions From Noise in Non-Monotonic Learning , 1992 .

[15]  U. Fayyad,et al.  On the handling of continuous-valued attributes in decision tree generation , 2004, Machine Learning.

[16]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  Robert H. Sloan,et al.  Corrigendum to types of noise in data for concept learning , 1988, COLT '92.

[19]  Claire Cardie,et al.  Using Decision Trees to Improve Case-Based Learning , 1993, ICML.

[20]  Natsuki Oka,et al.  Learning regular and irregular examples separately , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[21]  Foster J. Provost,et al.  Small Disjuncts in Action: Learning to Diagnose Errors in the Local Loop of the Telephone Network , 1993, ICML.

[22]  Isabelle Guyon,et al.  Discovering Informative Patterns and Data Cleaning , 1996, Advances in Knowledge Discovery and Data Mining.

[23]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[24]  J. Townshend,et al.  NDVI-derived land cover classifications at a global scale , 1994 .

[25]  C. Justice,et al.  A global 1° by 1° NDVI data set for climate studies derived from the GIMMS continental NDVI data , 1994 .

[26]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[27]  Scott E. Decatur Learning in Hybrid Noise Environments Using Statistical Queries , 1995, AISTATS.

[28]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[29]  Qi Zhao,et al.  Using Qualitative Hypotheses to Identify Inaccurate Data , 1995, J. Artif. Intell. Res..

[30]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[31]  Robert H. Sloan,et al.  Four Types of Noise in Data for PAC Learning , 1995, Inf. Process. Lett..

[32]  Carla E. Brodley,et al.  Improving automated land cover mapping by identifying and eliminating mislabeled observations from training data , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[33]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[34]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[35]  Padhraic Smyth Bounds on the mean classification error rate of multiple experts , 1996, Pattern Recognit. Lett..

[36]  N. Mati,et al.  Discovering Informative Patterns and Data Cleaning , 1996 .

[37]  Saso Dzeroski,et al.  Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois , 1996, ALT.

[38]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[39]  D. Randall Wilson,et al.  Advances in instance-based learning algorithms , 1997 .

[40]  Scott E. Decatur PAC Learning with Constant-Partition Classification Noise and Applications to Decision Tree Induction , 1997, ICML.

[41]  Tim Oates,et al.  The Effects of Training Set Size on Decision Tree Complexity , 1997, ICML.

[42]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[43]  Ralph Martinez,et al.  Reduction Techniques for Exemplar-Based Learning Algorithms , 1998 .

[44]  Kunio Yoshida,et al.  A Noise-Tolerant Hybrid Model of A Global and A Local Learning Module , 1999 .