Integration of feature vector selection and support vector machine for classification of imbalanced data

Abstract Support Vector Machine (SVM) has been widely developed for tackling classification problems. Imbalanced data exist in many practical classification problems where the minority class is usually the one of interest. Undersampling is a popular solution for such problems. However, it has the risk of losing useful information in the original data. At the same time, tuning the hyperparameters in SVM is also challenging. By analyzing the geometrical meaning of kernel methods, an approach is proposed in this paper that combines a modified Feature Vector Selection (FVS) method with maximal between-class separability and an easy-tuning version of SVM, i.e. Feature Vector Regression (FVR) proposed in our previous work. In this paper, the modified FVS method selects a small number of data points that can represent linearly all the dataset in the Reproducing Kernel Hilbert Space (RKHS) and the selected data points give also a maximal separability of the imbalanced data in RKHS. The FVR model is also solved analytically, as in least-squared SVM. The decision threshold for classification is optimized to maximize the predefined accuracy metric. Twenty-six imbalanced datasets are considered and comparisons are carried out with several SVM-based methods for imbalanced data. Statistical test shows the effectiveness of the proposed method.

[1]  Changyin Sun,et al.  Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data , 2015, Knowl. Based Syst..

[2]  Laetitia Vermeulen-Jourdan,et al.  Conception of a dominance-based multi-objective local search in the context of classification rule mining in large and imbalanced data sets , 2015, Appl. Soft Comput..

[3]  Mohamed Cheriet,et al.  Model selection for the LS-SVM. Application to handwriting recognition , 2009, Pattern Recognit..

[4]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[5]  Francisco Charte,et al.  MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation , 2015, Knowl. Based Syst..

[6]  Yi-Hung Liu,et al.  Face Recognition Using Total Margin-Based Adaptive Fuzzy Support Vector Machines , 2007, IEEE Transactions on Neural Networks.

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  G. Baudat,et al.  Feature vector selection and projection using kernels , 2003, Neurocomputing.

[9]  Zhe Wang,et al.  Gravitational fixed radius nearest neighbor for imbalanced problem , 2015, Knowl. Based Syst..

[10]  H Zareipour,et al.  Classification of Future Electricity Market Prices , 2011, IEEE Transactions on Power Systems.

[11]  Javier Pérez-Rodríguez,et al.  Class imbalance methods for translation initiation site recognition in DNA sequences , 2012, Knowl. Based Syst..

[12]  Andrew K. C. Wong,et al.  Classification of Imbalanced Data: a Review , 2009, Int. J. Pattern Recognit. Artif. Intell..

[13]  Javier Pérez-Rodríguez,et al.  OligoIS: Scalable Instance Selection for Class-Imbalanced Data Sets , 2013, IEEE Transactions on Cybernetics.

[14]  Enrico Zio,et al.  Feature vector regression with efficient hyperparameters tuning and geometric interpretation , 2016, Neurocomputing.

[15]  Yuqun Zhang,et al.  A maximum margin and minimum volume hyper-spheres machine with pinball loss for imbalanced data classification , 2016, Knowl. Based Syst..

[16]  T. Warren Liao,et al.  Classification of weld flaws with imbalanced class data , 2008, Expert Syst. Appl..

[17]  Francisco Herrera,et al.  An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics , 2013, Inf. Sci..

[18]  José Francisco Martínez Trinidad,et al.  An Empirical Study of Oversampling and Undersampling Methods for LCMine an Emerging Pattern Based Classifier , 2013, MCPR.

[19]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[20]  Elmar Wolfgang Lang,et al.  Unsupervised feature extraction via kernel subspace techniques , 2011, Neurocomputing.

[21]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[22]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[23]  María José del Jesús,et al.  KEEL: a software tool to assess evolutionary algorithms for data mining problems , 2008, Soft Comput..

[24]  Luís Torgo,et al.  A Survey of Predictive Modelling under Imbalanced Distributions , 2015, ArXiv.

[25]  Enrico Zio,et al.  A Novel Hybrid Method of Parameters Tuning in Support Vector Regression for Reliability Prediction: Particle Swarm Optimization Combined With Analytical Selection , 2016, IEEE Transactions on Reliability.

[26]  Shan Suthaharan,et al.  Support Vector Machine , 2016 .

[27]  Yunqian Ma,et al.  Practical selection of SVM parameters and noise estimation for SVM regression , 2004, Neural Networks.

[28]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[29]  Saeed Shojaee,et al.  Hybridizing two-stage meta-heuristic optimization model with weighted least squares support vector machine for optimal shape of double-arch dams , 2015, Appl. Soft Comput..

[30]  Joarder Kamruzzaman,et al.  z-SVM: An SVM for Improved Classification of Imbalanced Data , 2006, Australian Conference on Artificial Intelligence.

[31]  Nicola Torelli,et al.  Training and assessing classification rules with imbalanced data , 2012, Data Mining and Knowledge Discovery.

[32]  A. Sankar,et al.  Pattern Matching based Classification using Ant Colony Optimization based Feature Selection , 2015, Appl. Soft Comput..

[33]  Kezhi Mao,et al.  RBF neural network center selection based on Fisher ratio class separability measure , 2002, IEEE Trans. Neural Networks.

[34]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[35]  Gregory Ditzler,et al.  An ensemble based incremental learning framework for concept drift and class imbalance , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[36]  Juan José Rodríguez Diez,et al.  Random Balance: Ensembles of variable priors classifiers for imbalanced data , 2015, Knowl. Based Syst..

[37]  Jian Gao,et al.  A new sampling method for classifying imbalanced data based on support vector machine ensemble , 2016, Neurocomputing.

[38]  José Francisco Martínez Trinidad,et al.  Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases , 2016, Neurocomputing.

[39]  Zhi-Hua Zhou,et al.  Cost-Sensitive Face Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  James J. Chen,et al.  Class-imbalanced classifiers for high-dimensional data , 2013, Briefings Bioinform..

[41]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[42]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[43]  Z. Zenn Bien,et al.  Feature subset selection using separability index matrix , 2013, Inf. Sci..

[44]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[45]  R. Glynn,et al.  The Wilcoxon Signed Rank Test for Paired Comparisons of Clustered Data , 2006, Biometrics.

[46]  Shu-Ching Chen,et al.  Ensemble Learning from Imbalanced Data Set for Video Event Detection , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[47]  Liu Xiao,et al.  Adapted ensemble classification algorithm based on multiple classifier system and feature selection for classifying multi-class imbalanced data , 2016 .

[48]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[49]  Longbing Cao,et al.  Effective detection of sophisticated online banking fraud on extremely imbalanced data , 2012, World Wide Web.

[50]  Francisco Herrera,et al.  Ordering-based pruning for improving the performance of ensembles of classifiers in the framework of imbalanced datasets , 2016, Inf. Sci..

[51]  Jerzy Stefanowski,et al.  Addressing imbalanced data with argument based rule learning , 2015, Expert Syst. Appl..