Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification

Abstract In this work, we propose a novel feature selection approach designed to deal with two major issues in machine learning, namely class-imbalance and high dimensionality. The proposed embedded strategy penalizes the cardinality of the feature set via the scaling factors technique, and is used with two support vector machine (SVM) formulations designed to deal with the class-imbalanced problem, namely Cost Sensitive SVM, and Support Vector Data Description. The proposed concave formulations are solved via a Quasi-Newton update and Armijo line search. We performed experiments on 12 highly imbalanced microarray datasets using linear and Gaussian kernel, achieving the highest average predictive performance with our approach compared with the most well-known feature selection strategies.

[1]  Ali Hamzeh,et al.  DBFS: An effective Density Based Feature Selection scheme for small sample size and high dimensional imbalanced data sets , 2012, Data Knowl. Eng..

[2]  Taghi M. Khoshgoftaar,et al.  Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[3]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[4]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[5]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[6]  Dongjoon Kong,et al.  A New Feature Selection Method for One-Class Classification Problems , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[7]  Yves Grandvalet,et al.  Adaptive Scaling for Feature Selection in SVMs , 2002, NIPS.

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[10]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  Ping Zhong,et al.  A new one-class SVM based on hidden information , 2014, Knowl. Based Syst..

[13]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[14]  Slobodan Vucetic,et al.  BudgetedSVM: a toolbox for scalable SVM approximations , 2013, J. Mach. Learn. Res..

[15]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[16]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Bart Baesens Analytics in a Big Data World , 2014 .

[19]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[20]  Julio López,et al.  A multi-class SVM approach based on the l1-norm minimization of the distances between the reduced convex hulls , 2015, Pattern Recognit..

[21]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[22]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[23]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Changyin Sun,et al.  Support vector machine-based optimized decision threshold adjustment strategy for classifying imbalanced data , 2015, Knowl. Based Syst..

[25]  Kevin N. Gurney,et al.  An introduction to neural networks , 2018 .

[26]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[27]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[28]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[29]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[30]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[31]  Jianzhong Li,et al.  A stable gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[32]  Javier Pérez-Rodríguez,et al.  Class imbalance methods for translation initiation site recognition in DNA sequences , 2012, Knowl. Based Syst..

[33]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[34]  Ru-Sheng Liu,et al.  Pattern classification in DNA microarray data of multiple tumor types , 2006, Pattern Recognit..

[35]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[36]  Eric Horvitz,et al.  Considering Cost Asymmetry in Learning Classifiers , 2006, J. Mach. Learn. Res..

[37]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[38]  David G. Stork,et al.  Pattern Classification , 1973 .

[39]  Tao Li,et al.  Cost-sensitive feature selection using random forest: Selecting low-cost subsets of informative features , 2016, Knowl. Based Syst..

[40]  José Salvador Sánchez,et al.  On the effectiveness of preprocessing methods when dealing with different levels of class imbalance , 2012, Knowl. Based Syst..

[41]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[42]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[43]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[44]  Gabriele Steidl,et al.  Combined SVM-Based Feature Selection and Classification , 2005, Machine Learning.

[45]  Le Song,et al.  Feature Selection via Dependence Maximization , 2012, J. Mach. Learn. Res..

[46]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[47]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[48]  J. Welsh,et al.  Molecular classification of human carcinomas by use of gene expression signatures. , 2001, Cancer research.

[49]  R. Tibshirani,et al.  Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. , 2004, The New England journal of medicine.

[50]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[51]  M. J. D. Powell,et al.  A fast algorithm for nonlinearly constrained optimization calculations , 1978 .