Self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification

Abstract Imbalanced data classification poses a major challenge in data mining community. Although standard support vector machine can generally show relatively robust performance in dealing with the classification problems of imbalanced data set, it is a typical overall accuracy-oriented algorithm which results in the final decision boundary biasing toward the majority class. Some ensemble methods have emerged as meta-techniques for improving the generalization performance of existing learning algorithms. In this paper, we propose a novel self-adaptive cost weights-based support vector machine cost-sensitive ensemble for imbalanced data classification. In the proposed approach, to guarantee the consistency of optimization objectives between weak learners and boosting scheme, we not only apply cost-sensitive SVMs as basic weak leaner but also simultaneously modify the standard boosting scheme to cost-sensitive ones. In order to ensure more training minority instances for successive classifiers, especially borderline minority instances, we also present a self-adaptive sequential misclassification cost weights determination method. The method can self-adaptively consider the different contribution of minority instances to the form of SVM classifiers at each iteration based on the preceding obtained classifier during boosting, which can allow it to produce diverse classifiers and thus improve its generalization performance. In the experiments, we analyze and discuss the effect of different parameters on the performance and some suggestions are also provided. The extensive experimental results on the different imbalanced datasets demonstrate that the proposed approach can achieve better generalization performance in terms of G-Mean and F-Measure as compared to the other existing imbalanced dataset classification techniques.

[1]  Yue Xu,et al.  Cost-sensitive and hybrid-attribute measure multi-decision tree over imbalanced data sets , 2018, Inf. Sci..

[2]  Tri Dev Acharya,et al.  Landslide susceptibility mapping using J48 Decision Tree with AdaBoost, Bagging and Rotation Forest ensembles in the Guangchang area (China) , 2018 .

[3]  Md Zahidul Islam,et al.  Novel algorithms for cost-sensitive classification and knowledge discovery in class imbalanced datasets with an application to NASA software defects , 2018, Inf. Sci..

[4]  Francisco Herrera,et al.  Dynamic ensemble selection for multi-class imbalanced datasets , 2018, Inf. Sci..

[5]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[6]  Nathalie Japkowicz,et al.  Boosting support vector machines for imbalanced data sets , 2008, Knowledge and Information Systems.

[7]  Nello Cristianini,et al.  Controlling the Sensitivity of Support Vector Machines , 1999 .

[8]  Jian Gao,et al.  A new sampling method for classifying imbalanced data based on support vector machine ensemble , 2016, Neurocomputing.

[9]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[10]  Chih-Fong Tsai,et al.  Under-sampling class imbalanced datasets by combining clustering analysis and instance selection , 2019, Inf. Sci..

[11]  Muhammad Tahir,et al.  Protein subcellular localization of fluorescence microscopy images: Employing new statistical and Texton based image features and SVM based ensemble classification , 2016, Inf. Sci..

[12]  Chih-Fong Tsai,et al.  Clustering-based undersampling in class-imbalanced data , 2017, Inf. Sci..

[13]  Wei Feng,et al.  Class imbalance ensemble learning based on the margin theory , 2018 .

[14]  Chi-Hyuck Jun,et al.  Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification , 2017, Inf. Sci..

[15]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[16]  Ying Wang,et al.  Adaboost-SVM-based probability algorithm for the prediction of all mature miRNA sites based on structured-sequence features , 2019, Scientific Reports.

[17]  Sai-Ho Ling,et al.  A hybrid evolutionary preprocessing method for imbalanced datasets , 2018, Inf. Sci..

[18]  Yun Wang,et al.  Fault Feature Selection and Diagnosis of Rolling Bearings Based on EEMD and Optimized Elman_AdaBoost Algorithm , 2018, IEEE Sensors Journal.

[19]  David G. Renter,et al.  Evaluation of three classification models to predict risk class of cattle cohorts developing bovine respiratory disease within the first 14 days on feed using on-arrival and/or pre-arrival information , 2018, Comput. Electron. Agric..

[20]  Prashant Chatur,et al.  Medical decision support system for extremely imbalanced datasets , 2017, Inf. Sci..

[21]  Hamido Fujita,et al.  Imbalanced enterprise credit evaluation with DTE-SBD: Decision tree ensemble based on SMOTE and bagging with differentiated sampling rates , 2018, Inf. Sci..

[22]  Sang-Woong Lee,et al.  Robust face recognition via hierarchical collaborative representation , 2018, Inf. Sci..

[23]  Bo Tang,et al.  KernelADASYN: Kernel based adaptive synthetic data generation for imbalanced learning , 2015, 2015 IEEE Congress on Evolutionary Computation (CEC).

[24]  Xin Yao,et al.  MWMOTE--Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning , 2014 .

[25]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[26]  Aleksandra Werner,et al.  The study of under- and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis , 2017, Inf. Sci..

[27]  Jae-Yoon Jung,et al.  Imbalanced classification of manufacturing quality conditions using cost-sensitive decision tree ensembles , 2017, Int. J. Comput. Integr. Manuf..

[28]  Chunguo Wu,et al.  Globally-optimal prediction-based adaptive mutation particle swarm optimization , 2017, Inf. Sci..

[29]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[30]  Lin Wang,et al.  Machine learning based mobile malware detection using highly imbalanced network traffic , 2017, Inf. Sci..

[31]  Xiangliang Zhang,et al.  Abstracting massive data for lightweight intrusion detection in computer networks , 2016, Inf. Sci..

[32]  Stan Matwin,et al.  Addressing the Curse of Imbalanced Training Sets: One-Sided Selection , 1997, ICML.

[33]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[34]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[35]  Fernando Bação,et al.  Oversampling for Imbalanced Learning Based on K-Means and SMOTE , 2017, Inf. Sci..

[36]  Lei Wang,et al.  AdaBoost with SVM-based component classifiers , 2008, Eng. Appl. Artif. Intell..

[37]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[38]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[39]  Jianwu Shi,et al.  Comparative analysis of the complete mitochondrial genomes of three geographical topmouth culter (Culter alburnus) groups and implications for their phylogenetics , 2017, Bioscience, biotechnology, and biochemistry.

[40]  KhanAsifullah,et al.  Protein subcellular localization of fluorescence microscopy images , 2016 .

[41]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[42]  I. Tomek,et al.  Two Modifications of CNN , 1976 .