论文信息 - Learning from Imbalanced Data Sets: A Comparison of Various Strategies *

Learning from Imbalanced Data Sets: A Comparison of Various Strategies *

Although the majority of concept-learning systems previously designed usually assume that their training sets are well-balanced, this assumption is not necessarily correct. Indeed, there exists many domains for which one class is represented by a large number of examples while the other is represented by only a few. The purpose of this paper is 1) to demonstrate experimentally that, at least in the case of connectionist systems, class imbalances hinder the performance of standard classifiers and 2) to compare the performance of several approaches previously proposed to deal with the problem.

N. Japkowicz

[1] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[2] D. Wolpert. On Overfitting Avoidance as Bias , 1993 .

[3] Michael J. Pazzani,et al. Reducing Misclassification Costs , 1994, ICML.

[4] Nathalie Japkowicz,et al. A Novelty Detection Approach to Classification , 1995, IJCAI.

[5] Stan Matwin,et al. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[6] Charles X. Ling,et al. Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[7] Nathalie Japkowicz,et al. The Class Imbalance Problem: Significance and Strategies , 2000 .