An embedded feature selection method for imbalanced data classification

Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index ( WGI ) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve ( ROC AUC ) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of Fmeasure achieves excellent performance only if 20 % or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.

[1]  Jian Gao,et al.  A new sampling method for classifying imbalanced data based on support vector machine ensemble , 2016, Neurocomputing.

[2]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[3]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[4]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[5]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[6]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[7]  MengChu Zhou,et al.  Decision tree rule-based feature selection for large-scale imbalanced data , 2017, 2017 26th Wireless and Optical Communication Conference (WOCC).

[8]  William Zhu,et al.  Feature selection for multi-label classification using neighborhood preservation , 2018, IEEE/CAA Journal of Automatica Sinica.

[9]  MengChu Zhou,et al.  A Noise-Filtered Under-Sampling Scheme for Imbalanced Classification , 2017, IEEE Transactions on Cybernetics.

[10]  Francisco Herrera,et al.  On the use of MapReduce to build linguistic fuzzy rule based classification systems for big data , 2014, 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[11]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[12]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[13]  Jiujun Cheng,et al.  Dendritic Neuron Model With Effective Learning Algorithms for Classification, Approximation, and Prediction , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Leslie S. Smith,et al.  Feature subset selection in large dimensionality domains , 2010, Pattern Recognit..

[15]  Richard Weber,et al.  A wrapper method for feature selection using Support Vector Machines , 2009, Inf. Sci..

[16]  Gerald Schaefer,et al.  Cost-sensitive decision tree ensembles for effective imbalanced classification , 2014, Appl. Soft Comput..

[17]  Ashok Ghatol,et al.  Feature selection for medical diagnosis : Evaluation for cardiovascular diseases , 2013, Expert Syst. Appl..

[18]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[19]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[20]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[21]  Yongqiang Ye,et al.  Fractional envelope analysis for rolling element bearing weak fault feature extraction , 2017, IEEE/CAA Journal of Automatica Sinica.

[22]  Zhong Ming,et al.  An improved NSGA-III algorithm for feature selection used in intrusion detection , 2017, Knowl. Based Syst..

[23]  MengChu Zhou,et al.  A Distance-Based Weighted Undersampling Scheme for Support Vector Machines and its Application to Imbalanced Classification , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Francisco Herrera,et al.  SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory , 2012, Knowledge and Information Systems.

[25]  MengChu Zhou,et al.  Overlapping Community Change-Point Detection in an Evolving Network , 2020, IEEE Transactions on Big Data.

[26]  Xiaohui Yuan,et al.  Automatic feature point detection and tracking of human actions in time-of-flight videos , 2017, IEEE/CAA Journal of Automatica Sinica.

[27]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[28]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[29]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[30]  R. F. Woolson Wilcoxon Signed-Rank Test , 2008 .

[31]  MengChu Zhou,et al.  Weighted Gini index feature selection method for imbalanced data , 2018, 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC).

[32]  J. R. Quinlan Constructing Decision Trees , 1993 .

[33]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[34]  Kok-Leong Ong,et al.  Feature selection for high dimensional imbalanced class data using harmony search , 2017, Eng. Appl. Artif. Intell..

[35]  Julio López,et al.  Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification , 2018, Appl. Soft Comput..

[36]  Xiang Chen,et al.  The use of classification trees for bioinformatics , 2011, WIREs Data Mining Knowl. Discov..

[37]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[38]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[39]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[40]  Eiman Kambal,et al.  Credit scoring using data mining techniques with particular reference to Sudanese banks , 2013, 2013 INTERNATIONAL CONFERENCE ON COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE).