论文信息 - On the k-NN performance in a challenging scenario of imbalance and overlapping

On the k-NN performance in a challenging scenario of imbalance and overlapping

A two-class data set is said to be imbalanced when one (minority) class is heavily under-represented with respect to the other (majority) class. In the presence of a significant overlapping, the task of learning from imbalanced data can be a very difficult problem. Additionally, if the overall imbalance ratio is different from local imbalance ratios in overlap regions, the task can become in a major challenge. This paper explains the behaviour of the k-nearest neighbour (k-NN) rule when learning from such a complex scenario. This local model is compared to other machine learning algorithms, attending to how their behaviour depends on a number of data complexity features (global imbalance, size of overlap region, and its local imbalance). As a result, several conclusions useful for classifier design are inferred.

[1] Thomas M. Cover,et al. Estimation by the nearest neighbor rule , 1968, IEEE Trans. Inf. Theory.

[2] Tom Fawcett,et al. ROC graphs with instance-varying costs , 2006, Pattern Recognit. Lett..

[3] Roderick J. A. Little,et al. Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[4] D. Rubin,et al. Statistical Analysis with Missing Data , 1988 .

[5] Robert P. W. Duin,et al. Precision-recall operating characteristic (P-ROC) curves in imprecise environments , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[6] Nitesh V. Chawla,et al. SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7] Ester Bernadó-Mansilla,et al. The class imbalance problem in learning classifier systems: a preliminary study , 2005, GECCO '05.

[8] Tom Fawcett,et al. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[9] David J. Hand,et al. Choosing k for two-class nearest neighbour classifiers with unbalanced classes , 2003, Pattern Recognit. Lett..

[10] Stan Matwin,et al. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[11] Nikolaos M. Avouris,et al. EVALUATION OF CLASSIFIERS FOR AN UNEVEN CLASS DISTRIBUTION PROBLEM , 2006, Appl. Artif. Intell..

[12] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13] Michael J. Pazzani,et al. Reducing Misclassification Costs , 1994, ICML.

[14] Jerome H. Friedman,et al. On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[15] Nobuhiro Yugami,et al. Effects of domain characteristics on instance-based learning algorithms , 2003, Theor. Comput. Sci..

[16] Nathalie Japkowicz,et al. The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[17] Charles X. Ling,et al. Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18] Tom Fawcett,et al. Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[19] Belur V. Dasarathy,et al. Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[20] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[21] Haym Hirsh,et al. The effect of small disjuncts and class distribution on decision tree learning , 2003 .

[22] Gustavo E. A. P. A. Batista,et al. Class Imbalances versus Class Overlapping: An Analysis of a Learning System Behavior , 2004, MICAI.

[23] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[24] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[25] Josef Kittler,et al. Pattern recognition : a statistical approach , 1982 .

[26] Taeho Jo,et al. Class imbalances versus small disjuncts , 2004, SKDD.

[27] R. Barandelaa,et al. Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[28] Martin D. Buhmann,et al. Radial Basis Functions: Theory and Implementations: Preface , 2003 .

[29] Jonathan Goldstein,et al. When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[30] Roberto Alejo,et al. Analysis of new techniques to obtain quality training sets , 2003, Pattern Recognit. Lett..

[31] Pedro M. Domingos. MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[32] Peter E. Hart,et al. Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[33] Donald Perlis,et al. Explicitly biased generalization , 1989, Comput. Intell..