A Classification Method Based on Feature Selection for Imbalanced Data

Imbalanced data are very common in the real world, and it may deteriorate the performance of the conventional classification algorithms. In order to resolve the imbalanced classification problems, we propose an ensemble classification method that combines evolutionary under-sampling and feature selection. We employ the Bootstrap method in original data to generate many sample subsets. <inline-formula> <tex-math notation="LaTeX">$V$ </tex-math></inline-formula>-statistic is developed to measure the distribution of imbalanced data, and it is also taken as the optimization objective of the genetic algorithm for the under-sampling sample subsets. Moreover, we take <inline-formula> <tex-math notation="LaTeX">$F_{1}$ </tex-math></inline-formula> and <italic>Gmean</italic> indicators as two optimization objectives and employ the multiobjective ant colony optimization algorithm for feature selection of resampled data to construct an ensemble system. Ten low-dimensional and four high-dimensional typical imbalanced datasets are used in experiments. The six state-of-the-art algorithms and four measures are taken for a fair comparison. The experimental results show that our proposed system has a better classification performance compared with other algorithms, especially for the high-dimensional imbalanced data.

[1]  Wei Liu,et al.  A Cost-Sensitive Learning Strategy for Feature Extraction from Imbalanced Data , 2016, ICONIP.

[2]  Francisco Charte,et al.  MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation , 2015, Knowl. Based Syst..

[3]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[4]  Mikel Galar,et al.  Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy , 2016, Appl. Soft Comput..

[5]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[6]  Ni Gui-qiang Overview of Study on One-Class Classifiers , 2009 .

[7]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[8]  Jong-Seok Lee,et al.  A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification , 2016, IMCOM.

[9]  Jing Zhang,et al.  Cost-Sensitive Large margin Distribution Machine for classification of imbalanced data , 2016, Pattern Recognit. Lett..

[10]  Yuming Zhou,et al.  A novel ensemble method for classifying imbalanced data , 2015, Pattern Recognit..

[11]  Ankur Singh Bist,et al.  MACHINE LEARNING: A SURVEY , 2015 .

[12]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Xiao Liu,et al.  Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data , 2016, Knowl. Based Syst..

[14]  Kok-Leong Ong,et al.  Feature selection for high dimensional imbalanced class data using harmony search , 2017, Eng. Appl. Artif. Intell..

[15]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[16]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.

[17]  Liu Yi,et al.  A Method for Entity Resolution in High Dimensional Data Using Ensemble Classifiers , 2017 .

[18]  Wang Yan-xia Graph-based Ant System for Subset Problems , 2008 .

[19]  Francisco Herrera,et al.  Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm , 2016, Eng. Appl. Artif. Intell..

[20]  Habib Hamam,et al.  Artificial Intelligence Review , 2019, Advanced Methodologies and Technologies in Artificial Intelligence, Computer Simulation, and Human-Computer Interaction.

[21]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[22]  Xiao Liu,et al.  BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification , 2016, Eng. Appl. Artif. Intell..

[23]  Jesus A. Gonzalez,et al.  Symbolic One-Class Learning from Imbalanced Datasets: Application in Medical Diagnosis , 2009, Int. J. Artif. Intell. Tools.

[24]  Jesús S. Aguilar-Ruiz,et al.  Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches , 2012, Expert Syst. Appl..

[25]  Francisco Herrera,et al.  Evolutionary undersampling for imbalanced big data classification , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[26]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[27]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[28]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[29]  Javier Pérez-Rodríguez,et al.  Simultaneous instance and feature selection and weighting using evolutionary computation: Proposal and study , 2015, Appl. Soft Comput..

[30]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[31]  Xuehua Wang,et al.  Feature selection for high-dimensional imbalanced data , 2013, Neurocomputing.

[32]  Gerald Schaefer,et al.  Effective Imbalanced Classification of Breast Thermogram Features , 2015, PReMI.

[33]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[34]  Hong-Liang Dai,et al.  Imbalanced Protein Data Classification Using Ensemble FTM-SVM , 2015, IEEE Transactions on NanoBioscience.

[35]  Joelle Pineau,et al.  Online Bagging and Boosting for Imbalanced Data Streams , 2013, IEEE Transactions on Knowledge and Data Engineering.

[36]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[37]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[39]  Simon Fong,et al.  An Application of Oversampling, Undersampling, Bagging and Boosting in Handling Imbalanced Datasets , 2013, DaEng.

[40]  Yanchun Liang,et al.  A resampling ensemble algorithm for classification of imbalance problems , 2014, Neurocomputing.

[41]  Francisco Herrera,et al.  An insight into imbalanced Big Data classification: outcomes and challenges , 2017 .

[42]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[43]  Ajith Abraham,et al.  Modeling Insurance Fraud Detection Using Imbalanced Data Classification , 2015, NaBIC.

[44]  Hua Zhu,et al.  Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm , 2015 .

[45]  Yaping Lin,et al.  Synthetic minority oversampling technique for multiclass imbalance problems , 2017, Pattern Recognit..

[46]  María José del Jesús,et al.  Addressing Overlapping in Classification with Imbalanced Datasets: A First Multi-objective Approach for Feature and Instance Selection , 2015, IDEAL.

[47]  Safdar Ali,et al.  Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data , 2016, Comput. Biol. Medicine.

[48]  Andrea Esuli,et al.  Distributional Random Oversampling for Imbalanced Text Classification , 2016, SIGIR.

[49]  Witold Pedrycz,et al.  Dual autoencoders features for imbalance classification problem , 2016, Pattern Recognit..