An Introduction to Variable and Feature Selection

Variable and feature selection have become the focus of much research in areas of application for which datasets with tens or hundreds of thousands of variables are available. These areas include text processing of internet documents, gene expression array analysis, and combinatorial chemistry. The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data. The contributions of this special issue cover a wide range of aspects of such problems: providing a better definition of the objective function, feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[3]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[6]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[7]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[8]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[9]  Ron Kohavi,et al.  Wrappers for feature selection , 1997 .

[10]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[11]  Dale Schuurmans A New Metric-Based Approach to Model Selection , 1997, AAAI/IAAI.

[12]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Andrew Y. Ng,et al.  On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples , 1998, ICML.

[15]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[16]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[17]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[18]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[19]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[20]  Gérard Dreyfus,et al.  Withdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model , 2000, Neurocomputing.

[21]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[22]  Tommi S. Jaakkola,et al.  Feature Selection and Dualities in Maximum Entropy Discrimination , 2000, UAI.

[23]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Michael I. Jordan,et al.  Convergence rates of the Voting Gibbs classifier, with application to Bayesian feature selection , 2001, ICML.

[25]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[26]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[27]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[28]  Yves Grandvalet,et al.  Adaptive Scaling for Feature Selection in SVMs , 2002, NIPS.

[29]  Jouko Lampinen,et al.  Bayesian Input Variable Selection Using Posterior Probabilities and Expected Utilities , 2002 .

[30]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[31]  Juha Reunanen,et al.  Overfitting in Making Comparisons Between Variable Selection Methods , 2003, J. Mach. Learn. Res..

[32]  Nicolas Chapados,et al.  Extensions to Metric-Based Model Selection , 2003, J. Mach. Learn. Res..

[33]  Léon Personnaz,et al.  MLPs (Mono-Layer Polynomials and Multi-Layer Perceptrons) for Nonlinear Modeling , 2003, J. Mach. Learn. Res..

[34]  Kari Torkkola,et al.  Feature Extraction by Non-Parametric Mutual Information Maximization , 2003, J. Mach. Learn. Res..

[35]  Rich Caruana,et al.  Benefitting from the Variables that Variable Selection Discards , 2003, J. Mach. Learn. Res..

[36]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[37]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[38]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[39]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[40]  Gérard Dreyfus,et al.  Ranking a Random Feature for Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[41]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[42]  I. Guyon,et al.  Detecting stable clusters using principal component analysis. , 2003, Methods in molecular biology.

[43]  Jinbo Bi,et al.  Dimensionality Reduction via Sparse Support Vector Machines , 2003, J. Mach. Learn. Res..

[44]  Naftali Tishby,et al.  Sufficient Dimensionality Reduction , 2003, J. Mach. Learn. Res..

[45]  RivalsIsabelle,et al.  Mlps (mono layer polynomials and multi layer perceptrons) for nonlinear modeling , 2003 .

[46]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.