On the k-NN performance in a challenging scenario of imbalance and overlapping

A two-class data set is said to be imbalanced when one (minority) class is heavily under-represented with respect to the other (majority) class. In the presence of a significant overlapping, the task of learning from imbalanced data can be a very difficult problem. Additionally, if the overall imbalance ratio is different from local imbalance ratios in overlap regions, the task can become in a major challenge. This paper explains the behaviour of the k-nearest neighbour (k-NN) rule when learning from such a complex scenario. This local model is compared to other machine learning algorithms, attending to how their behaviour depends on a number of data complexity features (global imbalance, size of overlap region, and its local imbalance). As a result, several conclusions useful for classifier design are inferred.

[1]  Thomas M. Cover,et al.  Estimation by the nearest neighbor rule , 1968, IEEE Trans. Inf. Theory.

[2]  Tom Fawcett,et al.  ROC graphs with instance-varying costs , 2006, Pattern Recognit. Lett..

[3]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[4]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[5]  Robert P. W. Duin,et al.  Precision-recall operating characteristic (P-ROC) curves in imprecise environments , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  Ester Bernadó-Mansilla,et al.  The class imbalance problem in learning classifier systems: a preliminary study , 2005, GECCO '05.

[8]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[9]  David J. Hand,et al.  Choosing k for two-class nearest neighbour classifiers with unbalanced classes , 2003, Pattern Recognit. Lett..

[10]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[11]  Nikolaos M. Avouris,et al.  EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM , 2006, Appl. Artif. Intell..

[12]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  Michael J. Pazzani,et al.  Reducing Misclassification Costs , 1994, ICML.

[14]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[15]  Nobuhiro Yugami,et al.  Effects of domain characteristics on instance-based learning algorithms , 2003, Theor. Comput. Sci..

[16]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[17]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[19]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[20]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[21]  Haym Hirsh,et al.  The effect of small disjuncts and class distribution on decision tree learning , 2003 .

[22]  Gustavo E. A. P. A. Batista,et al.  Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[23]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[24]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[25]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[26]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.

[27]  R. Barandelaa,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[28]  Martin D. Buhmann,et al.  Radial Basis Functions: Theory and Implementations: Preface , 2003 .

[29]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[30]  Roberto Alejo,et al.  Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[31]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[32]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[33]  Donald Perlis,et al.  Explicitly biased generalization , 1989, Comput. Intell..