CLUS: A new hybrid sampling classification for imbalanced data

The new hybrid sampling approach called CLUS- CLUSter-based hybrid sampling approach is proposed in this paper to improve the performance of classifier for two-class imbalanced datasets. The objective of this research is to develop algorithm that can effectively classify two-class imbalanced datasets, which have complicated distributions and large overlap between classes. These problems can make the learners failed in classification. Therefore, the contribution of CLUS is to alleviate the large overlap between classes and to balance the class distribution. Firstly, all instances are partitioned into k clusters using k-mean algorithms. Next, CLUS created the new subset, which consists of the instances from different classes, which have different characteristics. Secondly, for each subset, oversampling method is applied. Finally, SVMs is used to classify each training set based on majority vote. CLUS is tested using eight imbalanced benchmark datasets and assessed over two metrics; F-measure and AUC. The experimental results show that CLUS outperforms other methods especially when the number of imbalanced ratio is high.

[1]  Haibo He,et al.  RAMOBoost: Ranked Minority Oversampling in Boosting , 2010, IEEE Transactions on Neural Networks.

[2]  Taghi M. Khoshgoftaar,et al.  An empirical comparison of repetitive undersampling techniques , 2009, 2009 IEEE International Conference on Information Reuse & Integration.

[3]  Mario Molinara,et al.  Facing Imbalanced Classes through Aggregation of Classifiers , 2007, 14th International Conference on Image Analysis and Processing (ICIAP 2007).

[4]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[5]  Bingru Yang,et al.  The problem of classification in imbalanced data sets in knowledge discovery , 2010, 2010 International Conference on Computer Application and System Modeling (ICCASM 2010).

[6]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[7]  Nathalie Japkowicz,et al.  Boosting support vector machines for imbalanced data sets , 2008, Knowledge and Information Systems.

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Nuanwan Soonthornphisaj,et al.  Clustering and Combined Sampling Approaches for Multi-class Imbalanced Data Classification , 2012 .

[10]  Li Xiaoli,et al.  Hybrid rebalancing approach to handle imbalanced dataset for fault diagnosis in manufacturing systems , 2012, 2012 7th IEEE Conference on Industrial Electronics and Applications (ICIEA).

[11]  Yang Liu,et al.  Combining integrated sampling with SVM ensembles for learning from imbalanced datasets , 2011, Inf. Process. Manag..

[12]  Byoung-Tak Zhang,et al.  Ensemble Learning with Active Example Selection for Imbalanced Biomedical Data Classification , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Lu Chen,et al.  A Novel Differential Evolution-Clustering Hybrid Resampling Algorithm on Imbalanced Datasets , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[14]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[15]  I. Tomek,et al.  Two Modifications of CNN , 1976 .

[16]  Li Hong,et al.  An Adaptive Sampling Ensemble Classifier for Learning from Imbalanced Data Sets , 2010 .

[17]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[18]  Hong Gu,et al.  Imbalanced classification using support vector machine ensemble , 2011, Neural Computing and Applications.

[19]  Yok-Yen Nguwi,et al.  An unsupervised self-organizing learning with support vector ranking for imbalanced datasets , 2010, Expert Syst. Appl..