Feature selection for high dimensional imbalanced class data based on F-measure optimization

Feature selection is designed to eliminate redundant attributes and improve classification accuracy. This is a challenging problem, especially in the case of imbalanced data. The traditional feature selection methods ignores the problem of class imbalance, making the selected features biased towards the majority class and neglecting the significant features for the minority class. Due to the advantage of F-measure in imbalanced data classification, we propose to use F-measure rather than accuracy as the optimization target in feature selection algorithm. This paper introduces a novel feature selection method SSVM-FS which is based on an optimal F-measure structural support vector machine classifier. Features will be selected according to the weight vector of SSVM which takes class imbalance problem into account. Based on this, we developed a comprehensive feature ranking method which integrate weight vector of SSVM and symmetric uncertainty. We use the comprehensive score to reduce the features to a suitable size and then use a harmony search to find the optimal combination of features to predict the target class label. The feature subset selected by the proposed method can represent both majority and minority class, in addition, it is less redundant. The experimental results on six high dimensional class imbalanced microarray data sets show that this method is a better method to solve the unbalanced classification.

[1]  Xinge You,et al.  Diverse Expected Gradient Active Learning for Relative Attributes , 2014, IEEE Transactions on Image Processing.

[2]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[3]  Guoquan Wang,et al.  A New Approach for Imbalanced Data Classification Based on Minimize Loss Learning , 2017, 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC).

[4]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[5]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[6]  Martin Jansche,et al.  Maximum Expected F-Measure Training of Logistic Regression Models , 2005, HLT.

[7]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[8]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[9]  José Ramón Quevedo,et al.  Multilabel classifiers with a probabilistic thresholding strategy , 2012, Pattern Recognit..

[10]  WestonJason,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002 .

[11]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[12]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[13]  Xuehua Wang,et al.  Feature selection for high-dimensional imbalanced data , 2013, Neurocomputing.

[14]  Yves Grandvalet,et al.  Optimizing F-Measures by Cost-Sensitive Classification , 2014, NIPS.

[15]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[16]  Nan Ye,et al.  Optimizing F-measure: A Tale of Two Approaches , 2012, ICML.

[17]  Zhenyu He,et al.  Robust Object Tracking via Key Patch Sparse Representation , 2017, IEEE Transactions on Cybernetics.