A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data

Imbalanced classification is one of the most important problems of machine learning and data mining, existing in many real datasets. In the past, many basic classifiers such as SVM, KNN, and so on have been used for imbalanced datasets in which the number of one sample is larger than that of another, but the classification effect is not ideal. Some data preprocessing methods have been proposed to reduce the imbalance ratio of data sets and combine with the basic classifiers to get better performance. In order to improve the whole classification accuracy, we propose a novel classifier ensemble framework based on K-means and resampling technique (EKR). First, we divide the data samples in the majority class into several sub-clusters using K-means, k-value is determined by Average Silhouette Coefficient, and then adjust the number of data samples of each sub-cluster to be the same as that of the minority classes through resampling technology, after that each adjusted sub-cluster and the minority class are combined into several balanced subsets, the base classifier is trained on each balanced subset separately, and finally integrated into a strong ensemble classifier. In this paper, the extensive experimental results on 16 imbalanced datasets demonstrate the effectiveness and feasibility of the proposed algorithm in terms of multiple evaluation criteria, and EKR can achieve better performance when compared with several classical imbalanced classification algorithms using different data preprocessing methods.

[1]  Lior Rokach,et al.  Fast-CBUS: A fast clustering-based undersampling method for addressing the class imbalance problem , 2017, Neurocomputing.

[2]  Yue Xu,et al.  Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets , 2018, Inf. Sci..

[3]  Yanchun Liang,et al.  A resampling ensemble algorithm for classification of imbalance problems , 2014, Neurocomputing.

[4]  Bo Sun,et al.  Evolutionary under-sampling based bagging ensemble method for imbalanced data classification , 2018, Frontiers of Computer Science.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Chih-Fong Tsai,et al.  Under-sampling class imbalanced datasets by combining clustering analysis and instance selection , 2019, Inf. Sci..

[7]  Hamid Parvin,et al.  Proposing a classifier ensemble framework based on classifier selection and decision tree , 2015, Eng. Appl. Artif. Intell..

[8]  Hamid Parvin,et al.  A New Imbalanced Learning and Dictions Tree Method for Breast Cancer Diagnosis , 2013 .

[9]  José Salvador Sánchez,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[10]  Ma Li,et al.  CURE-SMOTE algorithm and hybrid algorithm for feature selection and parameter optimization based on random forests , 2017, BMC Bioinformatics.

[11]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[12]  Maumita Bhattacharya,et al.  Intelligent Financial Fraud Detection: A Comprehensive Review , 2015 .

[13]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[14]  Rosa Maria Valdovinos,et al.  New Applications of Ensembles of Classifiers , 2003, Pattern Analysis & Applications.

[15]  Hamid Parvin,et al.  An Ensemble Based Approach for Feature Selection , 2011, EANN/AIAI.

[16]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[17]  Wei-Yang Lin,et al.  Intrusion detection by machine learning: A review , 2009, Expert Syst. Appl..

[18]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[19]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[20]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[21]  Hamid Parvin,et al.  Using sub-sampling and ensemble clustering techniques to improve performance of imbalanced classification , 2018, Neurocomputing.

[22]  Zhe Li,et al.  Adaptive Ensemble Undersampling-Boost: A novel learning framework for imbalanced data , 2017, J. Syst. Softw..

[23]  WestJarrod,et al.  Intelligent financial fraud detection , 2016 .

[24]  Verónica Bolón-Canedo,et al.  Ensembles for feature selection: A review and future trends , 2019, Inf. Fusion.

[25]  Lipika Dey,et al.  A k-mean clustering algorithm for mixed numeric and categorical data , 2007, Data Knowl. Eng..