Instance-based data reduction for improved identification of difficult small classes

We studied three different methods to improve identification of small classes, which are also difficult to classify, by balancing an imbalanced class distribution with data reduction. The new method, neighborhood cleaning (NCL) rule, outperformed simple random sampling within classes and one-sided selection method in the experiments with ten real world data sets. All reduction methods improved clearly identification of small classes (20--30%) true-positive rates of the three-nearest neighbor method and the C4.5 decision tree generator, but the differences between the methods were insignificant. However, the significant differences in accuracies, true-positive rates, and true-negative rates obtained from the reduced data were in favor of our method. The results suggest that the NCL rule is a useful method for improving modeling of difficult small classes, as well as for building classifiers that identify these classes from the real world data which frequently have an imbalanced class distribution.

[1]  Belur V. Dasarathy,et al.  Nearest Neighbour Editing and Condensing Tools–Synergy Exploitation , 2000, Pattern Analysis & Applications.

[2]  T. Hassard,et al.  Applied Linear Regression , 2005 .

[3]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[4]  Martti Juhola,et al.  Analysis of the imputed female urinary incontinence data for the evaluation of expert system parameters , 2001, Comput. Biol. Medicine.

[5]  Kevin Swingler,et al.  Applying neural networks - a practical guide , 1996 .

[6]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery in Databases , 1996, AI Mag..

[7]  E Kentala,et al.  Characteristics of six otologic diseases involving vertigo. , 1996, The American journal of otology.

[8]  Martti Juhola,et al.  Discovering Diagnostic Rules from a Neurotologic Database with Genetic Algorithms , 1999, The Annals of otology, rhinology, and laryngology.

[9]  Martti Juhola,et al.  Usefulness of imputation for the analysis of incomplete otoneurologic data , 2000, Int. J. Medical Informatics.

[10]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[11]  M Juhola,et al.  A genetic-based machine learning system to discover the diagnostic rules for female urinary incontinence. , 1998, Computer methods and programs in biomedicine.

[12]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[13]  Tony R. Martinez,et al.  Reduction Techniques for Instance-Based Learning Algorithms , 2000, Machine Learning.

[14]  M. Pett Nonparametric Statistics for Health Care Research: Statistics for Small Samples and Unusual Distributions , 1997 .