Feature subset selection using a new definition of classifiability

The performance of most practical classifiers improves when correlated or irrelevant features are removed. Machine based classification is thus often preceded by subset selection--a procedure which identifies relevant features of a high dimensional data set. At present, the most widely used subset selection technique is the so-called "wrapper" approach in which a search algorithm is used to identify candidate subsets and the actual classifier is used as a "black box" to evaluate the fitness of the subset. Fitness evaluation of the subset however requires cross-validation or other resampling based procedure for error estimation necessitating the construction of a large number of classifiers for each subset. This significant computational burden makes the wrapper approach impractical when a large number of features are present.In this paper, we present an approach to subset selection based on a novel definition of the classifiability of a given data. The classifiability measure we propose characterizes the relative ease with which some labeled data can be classified. We use this definition of classifiability to systematically add the feature which leads to the most increase in classifiability. The proposed approach does not require the construction of classifiers at each step and therefore does not suffer from as high a computational burden as a wrapper approach. Our results over several different data sets indicate that the results obtained are at least as good as that obtained with the wrapper approach.

[1]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[2]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[3]  A. R. Rao,et al.  A Taxonomy for Texture Description and Identification , 1990, Springer Series in Perception Engineering.

[4]  Michael H. Kutner Applied Linear Statistical Models , 1974 .

[5]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[7]  Ravi Kothari,et al.  Look-ahead based fuzzy decision tree induction , 2001, IEEE Trans. Fuzzy Syst..

[8]  J. Andrew Ware,et al.  Layered Neural Networks as Universal Approximators , 1997, Fuzzy Days.

[9]  Claire Cardie,et al.  Using Decision Trees to Improve Case-Based Learning , 1993, ICML.

[10]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[11]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[12]  M.,et al.  Statistical and Structural Approaches to Texture , 2022 .

[13]  V. Barnett,et al.  Applied Linear Statistical Models , 1975 .

[14]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[15]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[16]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[17]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[18]  FRED W. SMITH,et al.  Pattern Classifier Design by Linear Programming , 1968, IEEE Transactions on Computers.

[19]  Robert P. W. Duin,et al.  On the nonlinearity of pattern classifiers , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[20]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[21]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[22]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[23]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[24]  R. Tibshirani,et al.  Cross-Validation and the Bootstrap : Estimating the Error Rate ofa Prediction , 1995 .

[25]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.