An evaluation of classifier-specific filter measure performance for feature selection

Feature selection is an important part of classifier design. There are many possible methods for searching and evaluating feature subsets, but little consensus on which methods are best. This paper examines a number of filter-based feature subset evaluation measures with the goal of assessing their performance with respect to specific classifiers.This work tests 16 common filter measures for use with K-nearest neighbors and support vector machine classifiers. The measures are tested on 20 real and 20 artificial data sets, which are designed to probe for specific challenges. The strengths and weaknesses of each measure are discussed with respect to the specific challenges and correlation with classifier accuracy. The results highlight several challenging problems with a number of common filter measures.The results indicate that the best filter measure is classifier-specific. K-nearest neighbors classifiers work well with subset-based RELIEF, correlation feature selection or conditional mutual information maximization, whereas Fisher?s interclass separability criterion and conditional mutual information maximization work better for support vector machines. Despite the large number and variety of feature selection measures proposed in the literature, no single measure is guaranteed to outperform the others, even within a single classifier, and the overall performance of a feature selection method cannot be characterized independently of the subsequent classifier. HighlightsCompare common feature selection filter measures for use with specific classifiers.Many tested filter measures do not reliably predict classifier accuracy.Some measures have specific problems that cause them to select unsuitable features.Best feature selection filter measure is classifier specific.

[1]  Antonio Arauzo-Azofra,et al.  A feature set measure based on Relief , 2004 .

[2]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[3]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[4]  Joshua B. Tenenbaum,et al.  The Isomap Algorithm and Topological Stability , 2002, Science.

[5]  Josef Kittler,et al.  Pattern recognition : a statistical approach , 1982 .

[6]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[7]  Leslie S. Smith,et al.  Feature subset selection in large dimensionality domains , 2010, Pattern Recognit..

[8]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[9]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[10]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[11]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[12]  Pavel Pudil,et al.  Fast dependency-aware feature selection in very-high-dimensional pattern recognition , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[13]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[14]  Sebastian Thrun,et al.  The MONK''s Problems-A Performance Comparison of Different Learning Algorithms, CMU-CS-91-197, Sch , 1991 .

[15]  Dana Kulic,et al.  Joint feature selection and hierarchical classifier design , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[16]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[17]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[18]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[19]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[20]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[21]  Mineichi Kudo,et al.  Comparison of algorithms that select features for pattern classifiers , 2000, Pattern Recognit..

[22]  Gavin Brown,et al.  A New Perspective for Information Theoretic Feature Selection , 2009, AISTATS.

[23]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[24]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[26]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[27]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[29]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[30]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[31]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[32]  Huan Liu,et al.  Consistency-based search in feature selection , 2003, Artif. Intell..

[33]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[34]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[35]  L. Goddard Information Theory , 1962, Nature.