Feature Selection Filters Based on the Permutation Test

We investigate the problem of supervised feature selection within the filtering framework. In our approach, applicable to the two-class problems, the feature strength is inversely proportional to the p-value of the null hypothesis that its class-conditional densities, p(X|Y = 0) and p(X|Y = 1), are identical. To estimate the p-values, we use Fisher's permutation test combined with the four simple filtering criteria in the roles of test statistics: sample mean difference, symmetric Kullback-Leibler distance, information gain, and chi-square statistic. The experimental results of our study, performed using naive Bayes classifier and support vector machines, strongly indicate that the permutation test improves the above-mentioned filters and can be used effectively when sample size is relatively small and number of features relatively large.

[1]  Michael I. Jordan,et al.  Simultaneous Relevant Feature Identification and Classification in High-Dimensional Spaces , 2002, WABI.

[2]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns , 2002, Bioinform..

[3]  D. Michie Personal models of rationality , 1990 .

[4]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[5]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[6]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[7]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[8]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[9]  Brian R. Gaines An Ounce of Knowledge is Worth a Ton of Data: Quantitative studies of the Trade-Off between Expertise and Data Based On Statistically Well-Founded Empirical Induction , 1989, ML.

[10]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[11]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[12]  Ian H. Witten,et al.  Using a Permutation Test for Attribute Selection in Decision Trees , 1998, ICML.

[13]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[14]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[15]  Wei Zhong Liu,et al.  Bias in information-based measures in decision tree induction , 1994, Machine Learning.

[16]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[17]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[18]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[19]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[20]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[21]  Stan Matwin,et al.  Machine Learning for the Detection of Oil Spills in Satellite Radar Images , 1998, Machine Learning.

[22]  Igor Kononenko,et al.  On Biases in Estimating Multi-Valued Attributes , 1995, IJCAI.

[23]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[24]  Justin Doak,et al.  An evaluation of feature selection methods and their application to computer security , 1992 .

[25]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[26]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[27]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[28]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[29]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[30]  Paul W. Baim A Method for Attribute Selection in Inductive Learning Systems , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[32]  J. Kent Martin,et al.  An Exact Probability Metric for Decision Tree Splitting and Stopping , 1997, Machine Learning.

[33]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[34]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[35]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[36]  Justin Doak,et al.  CSE-92-18 - An Evaluation of Feature Selection Methodsand Their Application to Computer Security , 1992 .

[37]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[38]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[39]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[40]  Michael I. Jordan,et al.  Simultaneous classification and relevant feature identification in high-dimensional spaces: application to molecular profiling data , 2003, Signal Process..