Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

We consider regularized risk minimization in a large dictionary of Reproducing kernel Hilbert Spaces (RKHSs) over which the target function has a sparse representation. This setting, commonly referred to as Sparse Multiple Kernel Learning (MKL), may be viewed as the non-parametric extension of group sparsity in linear models. While the two dominant algorithmic strands of sparse learning, namely convex relaxations using l1 norm (e.g., Lasso) and greedy methods (e.g., OMP), have both been rigorously extended for group sparsity, the sparse MKL literature has so far mainly adopted the former with mild empirical success. In this paper, we close this gap by proposing a Group-OMP based framework for sparse MKL. Unlike l1-MKL, our approach decouples the sparsity regularizer (via a direct l0 constraint) from the smoothness regularizer (via RKHS norms), which leads to better empirical performance and a simpler optimization procedure that only requires a black-box single-kernel solver. The algorithmic development and empirical studies are complemented by theoretical analyses in terms of Rademacher generalization bounds and sparse recovery conditions analogous to those for OMP [27] and Group-OMP [16].

[1]  Naoki Abe,et al.  Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction , 2009, NIPS.

[2]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[3]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[4]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[5]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[6]  Vikas Sindhwani,et al.  Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference , 2010, NIPS.

[7]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[8]  Zenglin Xu,et al.  Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[9]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[10]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[11]  J. Lafferty,et al.  Sparse additive models , 2007, 0711.4555.

[12]  Alexander Zien,et al.  lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[13]  M. Yuan,et al.  Dimension reduction and coefficient estimation in multivariate linear regression , 2007 .

[14]  Peter L. Bartlett,et al.  A Unifying View of Multiple Kernel Learning , 2010, ECML/PKDD.

[15]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[16]  Cheng Soon Ong,et al.  Multiclass multiple kernel learning , 2007, ICML '07.

[17]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[18]  Tong Zhang,et al.  On the Consistency of Feature Selection using Greedy Least Squares Regression , 2009, J. Mach. Learn. Res..

[19]  Joel A. Tropp,et al.  Greed is good: algorithmic results for sparse approximation , 2004, IEEE Transactions on Information Theory.

[20]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[21]  Pascal Vincent,et al.  Kernel Matching Pursuit , 2002, Machine Learning.

[22]  Francis R. Bach,et al.  High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning , 2009, ArXiv.

[23]  V. Koltchinskii,et al.  SPARSITY IN MULTIPLE KERNEL LEARNING , 2010, 1211.2998.

[24]  Sundeep Rangan,et al.  Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis , 2009, NIPS.

[25]  Ryota Tomioka,et al.  Sparsity-accuracy trade-off in MKL , 2010, 1001.2615.

[26]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[27]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[28]  Martin J. Wainwright,et al.  Minimax-Optimal Rates For Sparse Additive Models Over Kernel Classes Via Convex Programming , 2010, J. Mach. Learn. Res..

[29]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[30]  Tong Zhang,et al.  Sparse Recovery With Orthogonal Matching Pursuit Under RIP , 2010, IEEE Transactions on Information Theory.

[31]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[32]  Mehryar Mohri,et al.  Generalization Bounds for Learning Kernels , 2010, ICML.

[33]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.