A new sampling approach for classification of imbalanced data sets with high density

Class imbalance of datasets is a common problem in the field of machine learning. In recent years, because the traditional classifier algorithms are designed only for balanced cases, these classifiers always achieved poor performance in imbalanced data classification issues, especially for the imbalanced data with a really high density. This paper introduces the importance of imbalanced data classification in various fields first; then, contends existing methods of solving the imbalanced data classification problem; finally, proposes two new sampling methods, which are based on borderline-SMOTE, for the imbalanced data with high density, especially for big data with this kind of distribution feature. These two new algorithms are not only over-sampling the minority samples near the borderline, but also creating appropriate synthetic samples in the majority class samples side and under-sampling some particular majority class samples. Experiments show that these two algorithms could achieve a better performance than random over sampling, SMOTE (Synthetic minority over-sampling technique) and Borderline-SMOTE in AUC (Area under Receiver Operating Characteristics Curve) metric evaluate method, when the sampling rate makes the majority class and minority class samples approximate equilibrium.

[1]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[2]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[3]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[4]  Jin Wang,et al.  Applying Threshold SMOTE Algoritwith Attribute Bagging to Imbalanced Datasets , 2013, RSKT.

[5]  David A. Cieslak,et al.  Start Globally, Optimize Locally, Predict Globally: Improving Performance on Imbalanced Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[6]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[7]  Ákos Utasi Weighted conditional mutual information based boosting for classification of imbalanced datasets , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[8]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[9]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[10]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[11]  Tze-Yun Leong,et al.  A Model Driven Approach to Imbalanced Data Sampling in Medical Decision Making , 2010, MedInfo.

[12]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[13]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[14]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[15]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[16]  Hewijin Christine Jiau,et al.  Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem , 2006 .

[17]  Nitesh V. Chawla,et al.  Species distribution modeling and prediction: A class imbalance problem , 2012, 2012 Conference on Intelligent Data Understanding.

[18]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[19]  Yi-Hung Liu,et al.  Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines , 2007, IEEE Transactions on Neural Networks.