Ensemble learning-based filter-centric hybrid feature selection framework for high-dimensional imbalanced data

Abstract In recent years, research on feature selection for high-dimensional imbalanced data has attracted a considerable amount of attention. The filter-wrapper hybrid method, which is a conventional method of feature selection for high-dimensional data, aims to reduce excessive computational time. On the other hand, ensemble learning-based feature selection, even though it has a high level of computational complexity, focuses exclusively on the discovery of robust features. From this perspective, combining these two feature selection methods is not easy. However, a combined method is essential to advancing machine learning research that addresses real-world problems. We propose an filter-centric hybrid method based on ensemble-learning that can select the best feature subset for high-dimensional imbalanced data. The basic concept of the proposed method is to design a feature evaluation scheme based on the filter method and to apply ensemble learning with reasonable computational time. To achieve this objective, our innovative method utilizes predictions produced by multiple classifiers as inputs of the feature evaluation function. As a result, it can reflect the predictive performance of the classifiers and overcome the low performance of selected features by filter methods. In addition, it can find robust features simultaneously. To demonstrate the superiority of the proposed method, we perform various experiments using 14 experimental datasets that consist of low-dimensional balanced, high-dimensional balanced, and high-dimensional imbalanced datasets. Finally, we compare the proposed method with state-of-the-art feature selection methods.

[1]  Choon Lin Tan,et al.  A new hybrid ensemble feature selection framework for machine learning-based phishing detection system , 2019, Inf. Sci..

[2]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[3]  Yaxin Bi The impact of diversity on the accuracy of evidential classifier ensembles , 2012, Int. J. Approx. Reason..

[4]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[5]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[6]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[7]  Xin Yao,et al.  Diversity analysis on imbalanced data sets by using ensemble models , 2009, 2009 IEEE Symposium on Computational Intelligence and Data Mining.

[8]  Yvan Saeys,et al.  Robust Feature Selection Using Ensemble Feature Selection Techniques , 2008, ECML/PKDD.

[9]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[10]  Xuehua Wang,et al.  Feature selection for high-dimensional imbalanced data , 2013, Neurocomputing.

[11]  Songyot Nakariyakul,et al.  High-dimensional hybrid feature selection using interaction information-guided search , 2018, Knowl. Based Syst..

[12]  Xin Fan,et al.  Feature selection for imbalanced data based on neighborhood rough sets , 2019, Inf. Sci..

[13]  Verónica Bolón-Canedo,et al.  Feature selection for high-dimensional data , 2016, Progress in Artificial Intelligence.

[14]  PesBarbara,et al.  Exploiting the ensemble paradigm for stable feature selection , 2017 .

[15]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[16]  Francisco Herrera,et al.  EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling , 2013, Pattern Recognit..

[17]  Nicoletta Dessì,et al.  Exploiting the ensemble paradigm for stable feature selection: A case study on high-dimensional genomic data , 2017, Inf. Fusion.

[18]  Verónica Bolón-Canedo,et al.  Ensemble feature selection: Homogeneous and heterogeneous approaches , 2017, Knowl. Based Syst..

[19]  Albert Y. Zomaya,et al.  Ensemble-Based Wrapper Methods for Feature Selection and Class Imbalance Learning , 2013, PAKDD.

[20]  Kok-Leong Ong,et al.  Feature selection for high dimensional imbalanced class data using harmony search , 2017, Eng. Appl. Artif. Intell..

[21]  Md. Rafiqul Islam,et al.  A hybrid-multi filter-wrapper framework to identify run-time behaviour for fast malware detection , 2018, Future Gener. Comput. Syst..

[22]  Yijing Li,et al.  Learning from class-imbalanced data: Review of methods and applications , 2017, Expert Syst. Appl..

[23]  Bartosz Krawczyk,et al.  Diversity measures for one-class classifier ensembles , 2014, Neurocomputing.

[24]  Alexey Tsymbal,et al.  Ensemble feature selection with the simple Bayesian classification , 2003, Inf. Fusion.

[25]  Jose Miguel Puerta,et al.  Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking , 2012, Knowl. Based Syst..

[26]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[27]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[28]  David A. Cieslak,et al.  Hellinger distance decision trees are robust and skew-insensitive , 2011, Data Mining and Knowledge Discovery.

[29]  John Yearwood,et al.  A Hybrid Feature Selection With Ensemble Classification for Imbalanced Healthcare Data: A Case Study for Brain Tumor Diagnosis , 2016, IEEE Access.

[30]  Jose Miguel Puerta,et al.  A GRASP algorithm for fast hybrid (filter-wrapper) feature subset selection in high-dimensional datasets , 2011, Pattern Recognit. Lett..

[31]  Sanyam Shukla,et al.  Class imbalance learning using UnderBagging based kernelized extreme learning machine , 2019, Neurocomputing.

[32]  Dana Kulic,et al.  An evaluation of classifier-specific filter measure performance for feature selection , 2015, Pattern Recognit..

[33]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[34]  Thiago J. M. Moura,et al.  Combining diversity measures for ensemble pruning , 2016, Pattern Recognit. Lett..

[35]  S. Sitharama Iyengar,et al.  Data-Driven Techniques in Disaster Information Management , 2017, ACM Comput. Surv..

[36]  Anongnart Srivihok,et al.  Wrapper Feature Subset Selection for Dimension Reduction Based on Ensemble Learning Algorithm , 2015 .