Study on Unbalanced Binary Classification with Unknown Misclassification Costs

With the rapid development of big data and machine learning technologies, many fields have begun to use related algorithms and methods. Classification algorithms have been widely used in the fields of financial risk identification, fault diagnosis, medical diagnosis, etc. However, the datasets are often unbalanced in these cases and the original methods fail to classify instances correctly. Many methods such as over-sampling, under-sampling and ensemble methods were raised to improve the classifier's performance, but which one to choose for a certain dataset still remains a problem. Therefore, this paper aims at a experimental conclusion on which kind of method can perform best on unbalanced classification problems generally. In detail, we evaluated the performances of 13 kinds of methods for unbalanced classification on several unbalanced datasets which have different amounts of instances and different ratios of positive instances, and finally came to a conclusion.

[1]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[2]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[3]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[4]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[5]  A. S. Schistad Solberg,et al.  A large-scale evaluation of features for automatic detection of oil spills in ERS SAR images , 1996, IGARSS '96. 1996 International Geoscience and Remote Sensing Symposium.

[6]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).