The Feature Selection Path in Kernel Methods

The problem of automatic feature selection/weighting in kernel methods is examined. We work on a formulation that optimizes both the weights of features and the parameters of the kernel model simultaneously, using L1 regularization for feature selection. Under quite general choices of kernels, we prove that there exists a unique regularization path for this problem, that runs from 0 to a stationary point of the non-regularized problem. We propose an ODE-based homotopy method to follow this trajectory. By following the path, our algorithm is able to automatically discard irrelevant features and to automatically go back and forth to avoid local optima. Experiments on synthetic and real datasets show that the method achieves low prediction error and is efficient in separating relevant from irrelevant features.

[1]  Saharon Rosset,et al.  Tracking Curved Regularized Optimization Solution Paths , 2004, NIPS 2004.

[2]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[3]  B. Dundas,et al.  DIFFERENTIAL TOPOLOGY , 2002 .

[4]  S. Sathiya Keerthi,et al.  An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models , 2006, NIPS.

[5]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[6]  Jinbo Bi,et al.  Dimensionality Reduction via Sparse Support Vector Machines , 2003, J. Mach. Learn. Res..

[7]  Peng Zhao,et al.  Stagewise Lasso , 2007, J. Mach. Learn. Res..

[8]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[9]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[10]  Gang Wang,et al.  A kernel path algorithm for support vector machines , 2007, ICML '07.

[11]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[12]  Michael I. Jordan,et al.  Computing regularization paths for learning multiple kernels , 2004, NIPS.

[13]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[14]  Gang Kou,et al.  Feature Selection for Nonlinear Kernel Support Vector Machines , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[15]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[16]  Yves Grandvalet,et al.  Adaptive Scaling for Feature Selection in SVMs , 2002, NIPS.

[17]  J. Milnor Topology from the differentiable viewpoint , 1965 .

[18]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[19]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[20]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[21]  J. Yorke,et al.  Finding zeroes of maps: homotopy methods that are constructive with probability one , 1978 .

[22]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.