A comprehensive study of eleven feature selection algorithms and their impact on text classification

Feature selection has been routinely used as a preprocessing step to remove irrelevant features and conquer the “curse of dimensionality”. In contrast to dimensionality reduction techniques such as PCA, the resulting features from feature selection are selected from the original feature space; hence, easy to interpret. A large host of feature selection algorithms has been proposed in the literature. This has created a critical issue: which algorithm should one use? Moreover, how does a feature selection method affect the performance of a given classification algorithm? This paper addresses these issues by (1) presenting an open source software system that integrates eleven feature selection algorithms and five common classifiers; and (2) systematically comparing and evaluating the selected features and their impact over these five classifiers using five datasets. Specifically, this system includes ten commonly adopted filter-based feature selection algorithms: ChiSquare, Information Gain, Fisher Score, Gini Index, Kruskal-Wallis, Laplacian Score, ReliefF, FCBF, CFS, and mRmR. It also includes one state-of-the-art embedded approach built upon Random Forests. The five classifiers are SVM, Random Forests, Naïve Bayes, kNN and C4.5 Decision Tree. Comprehensive evaluations consisting of around 1000 experiments were conducted over five text datasets. Several approximately equivalent groups (AEG), where algorithms in the same group select highly similar features, have been identified. Suitable feature-selection-classifier combinations have also been identified. For example, Chi-square and Information Gain form an AEG. Furthermore, Gini Index or Kruskal-Wallis together with SVM often produces classification performance that is comparable with or better than using all the original features. Such results will provide empirical guidelines for the data analytic community. The above software system is available at https://www.dropbox.com/sh/ryw23s52e98uhrv/AAANpc0JU4x6r3Sfv4qB5ERna?dl=0

[1]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[2]  Huan Liu,et al.  Chi2: feature selection and discretization of numeric attributes , 1995, Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence.

[3]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[4]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Paola Zuccolotto,et al.  Variable Selection Using Random Forests , 2006 .

[7]  Jiawei Han,et al.  Generalized Fisher Score for Feature Selection , 2011, UAI.

[8]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[9]  J. Venn,et al.  . On the diagrammatic and mechanical representation of propositions and reasonings , 2022 .

[10]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[11]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[12]  Frederick Livingston,et al.  Implementation of Breiman's Random Forest Machine Learning Algorithm , 2005 .

[13]  Lluís A. Belanche Muñoz,et al.  Feature selection algorithms: a survey and experimental evaluation , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[15]  Huan Liu,et al.  Advancing feature selection research , 2010 .

[16]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[20]  Huan Liu,et al.  Advancing Feature Selection Research − ASU Feature Selection Repository , 2010 .

[21]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[22]  Hui Yang,et al.  Mining Biomedical Text towards Building a Quantitative Food-Disease-Gene Network , 2011, Learning Structure and Schemas from Documents.

[23]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Huan Liu,et al.  Spectral Feature Selection for Data Mining , 2011 .

[26]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  R. Lowry,et al.  Concepts and Applications of Inferential Statistics , 2014 .