An algorithm for correcting mislabeled data

Reliable evaluation for the performance of classifiers depends on the quality of the data sets on which they are tested. During the collecting and recording of a data set, however, some noise may be introduced into the data, especially in various real-world environments, which can degrade the quality of the data set. In this paper, we present a novel approach, called ADE (automatic data enhancement), to correct mislabeled data in a data set. In addition to using multi-layer neural networks trained by backpropagation as the basic framework, ADE assigns each training pattern a class probability vector as its class label, in which each component represents the probability of the corresponding class. During training, ADE constantly updates the probability vector based on its difference from the output of the network. With this updating rule, the probability of a mislabeled class gradually becomes smaller while that of the correct class becomes larger, which eventually causes the correction of mislabeled data after a number of training epochs. We have tested ADE on a number of data sets drawn from the UCI data repository for nearest neighbor classifiers. The results show that for most data sets, when there exists mislabeled data, a classifier constructed using a training set corrected by ADE can achieve significantly higher accuracy than that without using ADE.

[1]  Choh-Man Teng Evaluating Noise Correction , 2000, PRICAI.

[2]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[3]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[4]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[5]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[6]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[7]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[8]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[9]  Ralph Martinez,et al.  Reduction Techniques for Exemplar-Based Learning Algorithms , 1998 .

[10]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[11]  Saso Dzeroski,et al.  Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois , 1996, ALT.

[12]  David W. Aha,et al.  Noise-Tolerant Instance-Based Learning Algorithms , 1989, IJCAI.

[13]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[14]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[15]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[16]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[17]  Belur V. Dasarathy,et al.  Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.