Class Noise Mitigation Through Instance Weighting

We describe a novel framework for class noise mitigation that assigns a vector of class membership probabilities to each training instance, and uses the confidence on the current label as a weight during training. The probability vector should be calculated such that clean instances have a high confidence on its current label, while mislabeled instances have a low confidence on its current label and a high confidence on its correct label. Past research focuses on techniques that either discard or correct instances. This paper proposes that discarding and correcting are special cases of instance weighting, and thus, part of this framework. We propose a method that uses clustering to calculate a probability distribution over the class labels for each instance. We demonstrate that our method improves classifier accuracy over the original training set. We also demonstrate that instance weighting can outperform discarding.

[1]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[2]  Fabrice Muhlenbach,et al.  Identifying and Handling Mislabelled Instances , 2004, Journal of Intelligent Information Systems.

[3]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[4]  Choh-Man Teng,et al.  Correcting Noisy Data , 1999, ICML.

[5]  Fabrice Muhlenbach,et al.  Improving Classification by Removing or Relabeling Mislabeled Instances , 2002, ISMIS.

[6]  Ian Witten,et al.  Data Mining , 2000 .

[7]  Xindong Wu,et al.  Eliminating Class Noise in Large Datasets , 2003, ICML.

[8]  Bernhard Schölkopf,et al.  Estimating a Kernel Fisher Discriminant in the Presence of Label Noise , 2001, ICML.

[9]  Tony R. Martinez,et al.  An algorithm for correcting mislabeled data , 2001, Intell. Data Anal..

[10]  Carla E. Brodley,et al.  Identifying and Eliminating Mislabeled Training Instances , 1996, AAAI/IAAI, Vol. 1.

[11]  Sofie Verbaeten,et al.  Identifying mislabeled training examples in ILP Classification Problems , 2002 .

[12]  Saso Dzeroski,et al.  Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois , 1996, ALT.

[13]  Carla E. Brodley,et al.  Identifying Mislabeled Training Data , 1999, J. Artif. Intell. Res..

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[16]  Tony R. Martinez,et al.  A noise filtering method using neural networks , 2003, IEEE International Workshop on Soft Computing Techniques in Instrumentation, Measurement and Related Applications, 2003. SCIMA 2003..

[17]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.