A Framework for Feature Selection to Exploit Feature Group Structures

Filter feature selection methods play an important role in machine learning tasks when low computational costs, classifier independence or simplicity is important. Existing filter methods predominantly focus only on the input data and do not take advantage of the external sources of correlations within feature groups to improve the classification accuracy. We propose a framework which facilitates supervised filter feature selection methods to exploit feature group information from external sources of knowledge and use this framework to incorporate feature group information into minimum Redundancy Maximum Relevance (mRMR) algorithm, resulting in GroupMRMR algorithm. We show that GroupMRMR achieves high accuracy gains over mRMR (up to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sim }$$\end{document}35%) and other popular filter methods (up to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\sim }$$\end{document}50%). GroupMRMR has same computational complexity as that of mRMR, therefore, does not incur additional computational costs. Proposed method has many real world applications, particularly the ones that use genomic, text and image data whose features demonstrate strong group structures.

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  Chakkrit Tantithamthavorn,et al.  Mining Software Defects: Should We Consider Affected Releases? , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[3]  Mengjie Zhang,et al.  Differential evolution for filter feature selection based on information theory and feature ranking , 2018, Knowl. Based Syst..

[4]  Yu Kui,et al.  A Scalable and Accurate Online Feature Selection for Big Data * , 2016 .

[5]  Verónica Bolón-Canedo,et al.  On the scalability of feature selection methods on high-dimensional data , 2017, Knowledge and Information Systems.

[6]  Chris H. Q. Ding,et al.  Stable feature selection via dense feature groups , 2008, KDD.

[7]  Yun Zhu,et al.  Support vector machines and Word2vec for text classification with semantic features , 2015, 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC).

[8]  James Bailey,et al.  Effective global approaches for mutual information based feature selection , 2014, KDD.

[9]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[10]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[11]  R. Tibshirani,et al.  A note on the group lasso and a sparse group lasso , 2010, 1001.0736.

[12]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[13]  Jing Wang,et al.  Online Feature Selection with Group Structure Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[14]  Bo Liu,et al.  Uncorrelated Group LASSO , 2016, AAAI.

[15]  Yuxiao Hu,et al.  Learning a Spatially Smooth Subspace for Face Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Christoph Treude,et al.  AutoSpearman: Automatically Mitigating Correlated Software Metrics for Interpreting Defect Models , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[17]  Bernd Bischl,et al.  Benchmark for filter methods for feature selection in high-dimensional classification data , 2020, Comput. Stat. Data Anal..

[18]  R. Enayatifar,et al.  Heuristic filter feature selection methods for medical datasets. , 2020, Genomics.

[19]  Sudipta Acharya,et al.  Unsupervised gene selection using biological knowledge : application in sample clustering , 2017, BMC Bioinformatics.