Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data

Class imbalance problem occurs when the number of training instances belonging to different classes are clearly different. In this scenario, many traditional classifiers often fail to provide excellent enough classification performance, i.e., the accuracy of the majority class is usually much higher than that of the minority class. In this article, we consider to deal with class imbalance problem by utilizing support vector machine (SVM) classifier with an optimized decision threshold adjustment strategy (SVM-OTHR), which answers a puzzled question: how far the classification hyperplane should be moved towards the majority class? Specifically, the proposed strategy is self-adapting and can find the optimal moving distance of the classification hyperplane according to the real distributions of training samples. Furthermore, we also extend the strategy to develop an ensemble version (EnSVM-OTHR) that can further improve the classification performance. Two proposed algorithms are both compared with many state-of-the-art classifiers on 30 skewed data sets acquired from Keel data set Repository by using two popular class imbalance evaluation metrics: F-measure and G-mean. The statistical results of the experiments indicate their superiority.

[1]  Taghi M. Khoshgoftaar,et al.  Hybrid sampling for imbalanced data , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[2]  Seyed Mohammad Hosseini,et al.  A Novel Weighted Support Vector Machine Based on Particle Swarm Optimization for Gene Selection and Tumor Classification , 2012, Comput. Math. Methods Medicine.

[3]  Hualong Yu,et al.  Estimating harmfulness of class imbalance by scatter matrix based class separability measure , 2014, Intell. Data Anal..

[4]  P. N. Suganthan,et al.  An approach for classification of highly imbalanced data using weighting and undersampling , 2010, Amino Acids.

[5]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Joarder Kamruzzaman,et al.  z-SVM: An SVM for Improved Classification of Imbalanced Data , 2006, Australian Conference on Artificial Intelligence.

[8]  Charles X. Ling,et al.  Data Mining for Direct Marketing: Problems and Solutions , 1998, KDD.

[9]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Zhi-Hua Zhou,et al.  Ieee Transactions on Knowledge and Data Engineering 1 Training Cost-sensitive Neural Networks with Methods Addressing the Class Imbalance Problem , 2022 .

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[14]  Ciza Thomas,et al.  Improving intrusion detection for imbalanced network traffic , 2013, Secur. Commun. Networks.

[15]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[16]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[17]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[18]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[19]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[20]  Jing Zhao,et al.  ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data , 2013, Neurocomputing.

[21]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[22]  Francisco Herrera,et al.  Evolutionary-based selection of generalized instances for imbalanced classification , 2012, Knowl. Based Syst..

[23]  Xuelong Li,et al.  Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[25]  Shaogang Gong,et al.  Stream-Based Active Unusual Event Detection , 2010, ACCV.

[26]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[27]  Jun Ni,et al.  An Improved Ensemble Learning Method for Classifying High-Dimensional and Imbalanced Biomedicine Data , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[29]  Javier Pérez-Rodríguez,et al.  Class imbalance methods for translation initiation site recognition in DNA sequences , 2012, Knowl. Based Syst..

[30]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[31]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[32]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[33]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[34]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[35]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[36]  Anders Krogh,et al.  Neural Network Ensembles, Cross Validation, and Active Learning , 1994, NIPS.

[37]  Yanqing Zhang,et al.  Fast and Effective Spam Sender Detection with Granular SVM on Highly Imbalanced Mail Server Behavior Data , 2006, 2006 International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[38]  Chao Wang,et al.  Integration of Ontology Data through Learning Instance Matching , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[39]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[40]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[41]  Zhi-Hua Zhou,et al.  Exploratory Under-Sampling for Class-Imbalance Learning , 2006, Sixth International Conference on Data Mining (ICDM'06).

[42]  Taghi M. Khoshgoftaar,et al.  RUSBoost: A Hybrid Approach to Alleviating Class Imbalance , 2010, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.