Non-monotonic feature selection

We consider the problem of selecting a subset of m most informative features where m is the number of required features. This feature selection problem is essentially a combinatorial optimization problem, and is usually solved by an approximation. Conventional feature selection methods address the computational challenge in two steps: (a) ranking all the features by certain scores that are usually computed independently from the number of specified features m, and (b) selecting the top m ranked features. One major shortcoming of these approaches is that if a feature f is chosen when the number of specified features is m, it will always be chosen when the number of specified features is larger than m. We refer to this property as the "monotonic" property of feature selection. In this work, we argue that it is important to develop efficient algorithms for non-monotonic feature selection. To this end, we develop an algorithm for non-monotonic feature selection that approximates the related combinatorial optimization problem by a Multiple Kernel Learning (MKL) problem. We also present a strategy that derives a discrete solution from the approximate solution of MKL, and show the performance guarantee for the derived discrete solution when compared to the global optimal solution for the related combinatorial optimization problem. An empirical study with a number of benchmark data sets indicates the promising performance of the proposed framework compared with several state-of-the-art approaches for feature selection.

[1]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[2]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[3]  Gabriele Steidl,et al.  Combined SVM-Based Feature Selection and Classification , 2005, Machine Learning.

[4]  Zenglin Xu,et al.  An Extended Level Method for Efficient Multiple Kernel Learning , 2008, NIPS.

[5]  Nuno Vasconcelos,et al.  Direct convex relaxations of sparse SVM , 2007, ICML '07.

[6]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[7]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[8]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[9]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[10]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[11]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[12]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[13]  Yves Grandvalet,et al.  More efficiency in multiple kernel learning , 2007, ICML '07.

[14]  Bernhard Schölkopf,et al.  Use of the Zero-Norm with Linear Models and Kernel Methods , 2003, J. Mach. Learn. Res..

[15]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[16]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[17]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[18]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[19]  Le Song,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.

[20]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[21]  Glenn Fung,et al.  Data selection for support vector machine classifiers , 2000, KDD '00.

[22]  Mehryar Mohri,et al.  Learning sequence kernels , 2008 .

[23]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.