High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning

We consider the problem of high-dimensional non-linear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize non-linear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graph-adapted sparsity-inducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in high-dimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show state-of-the-art predictive performance for non-linear regression problems.

[1]  Frank Harary,et al.  Graph Theory , 2016 .

[2]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[3]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[4]  Tong Zhang,et al.  On the Consistency of Feature Selection using Greedy Least Squares Regression , 2009, J. Mach. Learn. Res..

[5]  Tong Zhang Some sharp performance bounds for least squares regression with L1 regularization , 2009, 0908.2869.

[6]  Kenji Fukumizu,et al.  Statistical Consistency of Kernel Canonical Correlation Analysis , 2007 .

[7]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[8]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[9]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[10]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[11]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[12]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[13]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[14]  Chong Gu Smoothing Spline Anova Models , 2002 .

[15]  Volker Roth,et al.  The generalized LASSO , 2004, IEEE Transactions on Neural Networks.

[16]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[17]  Alexander J. Smola,et al.  Boîte à outils SVM simple et rapide , 2005, Rev. d'Intelligence Artif..

[18]  Kenji Fukumizu,et al.  Kernels on Structured Objects Through Nested Histograms , 2006, NIPS.

[19]  Claude Lemaréchal,et al.  Practical Aspects of the Moreau-Yosida Regularization: Theoretical Preliminaries , 1997, SIAM J. Optim..

[20]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[21]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[22]  Willem Stuursma Image classification using ROIs and Multiple Kernel Learning , 2009 .

[23]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[24]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[25]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[26]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[27]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[28]  H. Markowitz The optimization of a quadratic function subject to linear constraints , 1956 .

[29]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[30]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[31]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[32]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..

[33]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[34]  Alexander J. Smola,et al.  Learning the Kernel with Hyperkernels , 2005, J. Mach. Learn. Res..

[35]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[36]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[37]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[38]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[39]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[40]  M. Yuan,et al.  On the non‐negative garrotte estimator , 2007 .

[41]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[42]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[43]  Zaïd Harchaoui,et al.  Image Classification with Segmentation Graph Kernels , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Larry A. Wasserman,et al.  SpAM: Sparse Additive Models , 2007, NIPS.

[45]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[46]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[47]  Robert L. Patten Combinatorics: Topics, Techniques, Algorithms , 1995 .

[48]  G. Wahba Spline models for observational data , 1990 .

[49]  Zaïd Harchaoui,et al.  Testing for Homogeneity with Kernel Fisher Discriminant Analysis , 2007, NIPS.

[50]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[51]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[52]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[53]  Shai Ben-David,et al.  Learning Bounds for Support Vector Machines with Learned Kernels , 2006, COLT.

[54]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[55]  Yves Grandvalet,et al.  Composite kernel learning , 2008, ICML '08.

[56]  C. Campbell,et al.  Generalization bounds for learning the kernel , 2009 .

[57]  Michael I. Jordan,et al.  Computing regularization paths for learning multiple kernels , 2004, NIPS.

[58]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[59]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[60]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[61]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[62]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[63]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[64]  Karim Lounici Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators , 2008, 0801.4610.

[65]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[66]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[67]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[68]  O. Chapelle Second order optimization of kernel parameters , 2008 .

[69]  Michael I. Jordan,et al.  Predictive low-rank decomposition for kernel methods , 2005, ICML.

[70]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[71]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[72]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[73]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[74]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[75]  Ming Yuan,et al.  Sparse Recovery in Large Ensembles of Kernel Machines On-Line Learning and Bandits , 2008, COLT.

[76]  A. Rinaldo,et al.  On the asymptotic properties of the group lasso estimator for linear models , 2008 .

[77]  Volker Roth,et al.  The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms , 2008, ICML '08.

[78]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[79]  Massimiliano Pontil,et al.  Taking Advantage of Sparsity in Multi-Task Learning , 2009, COLT.

[80]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[81]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[82]  K. Lange,et al.  Coordinate descent algorithms for lasso penalized regression , 2008, 0803.3876.