Regularization Strategies and Empirical Bayesian Learning for MKL

Multiple kernel learning (MKL), structured sparsity, and multi-task learning have recently received considerable attention. In this paper, we show how different MKL algorithms can be understood as applications of either regularization on the kernel weights or block-norm-based regularization, which is more common in structured sparsity and multi-task learning. We show that these two regularization strategies can be systematically mapped to each other through a concave conjugate operation. When the kernel-weight-based regularizer is separable into components, we can naturally consider a generative probabilistic model behind MKL. Based on this model, we propose learning algorithms for the kernel weights through the maximization of marginal likelihood. We show through numerical experiments that $\ell_2$-norm MKL and Elastic-net MKL achieve comparable accuracy to uniform kernel combination. Although uniform kernel combination might be preferable from its simplicity, $\ell_2$-norm MKL and Elastic-net MKL can learn the usefulness of the information sources represented as kernels. In particular, Elastic-net MKL achieves sparsity in the kernel weights.

[1]  David P. Wipf,et al.  A New View of Automatic Relevance Determination , 2007, NIPS.

[2]  Klaus-Robert Müller,et al.  Efficient and Accurate Lp-Norm Multiple Kernel Learning , 2009, NIPS.

[3]  O. Chapelle Second order optimization of kernel parameters , 2008 .

[4]  David P. Wipf,et al.  A unified Bayesian framework for MEG/EEG source imaging , 2009, NeuroImage.

[5]  Bhaskar D. Rao,et al.  Variational EM Algorithms for Non-Gaussian Latent Variable Models , 2005, NIPS.

[6]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[7]  Gunnar Rätsch,et al.  The SHOGUN Machine Learning Toolbox , 2010, J. Mach. Learn. Res..

[8]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[9]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[10]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[11]  Ryota Tomioka,et al.  Sparsity-accuracy trade-off in MKL , 2010, 1001.2615.

[12]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[13]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[15]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Zhihua Zhang,et al.  Bayesian inference for transductive learning of kernel matrix using the Tanner-Wong data augmentation algorithm , 2004, ICML.

[17]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[18]  Michael I. Jordan,et al.  Computing regularization paths for learning multiple kernels , 2004, NIPS.

[19]  Cheng Soon Ong,et al.  Multiclass multiple kernel learning , 2007, ICML '07.

[20]  Charles A. Micchelli,et al.  A Family of Penalty Functions for Structured Sparsity , 2010, NIPS.

[21]  Simon Rogers,et al.  Hierarchic Bayesian models for kernel learning , 2005, ICML.

[22]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[23]  Francis R. Bach,et al.  Structured Variable Selection with Sparsity-Inducing Norms , 2009, J. Mach. Learn. Res..

[24]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[25]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[26]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[27]  M. Kloft,et al.  Non-sparse Multiple Kernel Learning , 2008 .

[28]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[29]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[30]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[31]  Trevor Darrell,et al.  The Pyramid Match Kernel: Efficient Learning with Sets of Features , 2007, J. Mach. Learn. Res..

[32]  Matthias W. Seeger,et al.  Large Scale Variational Inference and Experimental Design for Sparse Generalized Linear Models , 2008, Sampling-based Optimization in the Presence of Uncertainty.

[33]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[34]  Junzhou Huang,et al.  Learning with structured sparsity , 2009, ICML '09.

[35]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[36]  Mark J. F. Gales,et al.  Combining Derivative and Parametric Kernels for Speaker Verification , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[37]  Mehryar Mohri,et al.  Generalization Bounds for Learning Kernels , 2010, ICML.

[38]  Peter L. Bartlett,et al.  A Unifying View of Multiple Kernel Learning , 2010, ECML/PKDD.

[39]  Sebastian Nowozin,et al.  Let the kernel figure it out; Principled learning of pre-processing for kernel classifiers , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[41]  Mehryar Mohri,et al.  L2 Regularization for Learning Kernels , 2009, UAI.

[42]  Cedric Archambeau,et al.  Multiple Gaussian Process Models , 2011, 1110.5238.

[43]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[44]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..