Norm Multiple Kernel Learning

Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this l1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures that generalize well, we extend MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, that is lp-norms with p≥ 1. This interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on several data sets. A theoretical analysis and an experiment on controlled artificial data shed light on the appropriateness of sparse, non-sparse and l∞-norm MKL in various scenarios. Importantly, empirical applications of lp-norm MKL to three real-world problems from computational biology show that non-sparse MKL achieves accuracies that surpass the state-of-the-art. Data sets, source code to reproduce the experiments, implementations of the algorithms, and further information are available at http://doc.ml.tu-berlin.de/nonsparse_mkl/.

[1]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[2]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[3]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[4]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[5]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[6]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[7]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[8]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[9]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[10]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[11]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[12]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[13]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[14]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[15]  Kenta Nakai,et al.  DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs , 2002, Nucleic Acids Res..

[16]  Olivier Bousquet,et al.  On the Complexity of Learning the Kernel Matrix , 2002, NIPS.

[17]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[18]  Vitalii P. Tanana,et al.  Theory of Linear Ill-Posed Problems and its Applications , 2002 .

[19]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[20]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[21]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[22]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[23]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[24]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[25]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[26]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[27]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[28]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[29]  Alexander J. Smola,et al.  Learning the Kernel with Hyperkernels , 2005, J. Mach. Learn. Res..

[30]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[31]  D. J. H. Garling,et al.  The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities by J. Michael Steele , 2005, Am. Math. Mon..

[32]  Yoshihiro Yamanishi,et al.  Supervised enzyme network inference from the integration of genomic data and chemical information , 2005, ISMB.

[33]  Gunnar Rätsch,et al.  Learning Interpretable SVMs for Biological Sequence Classification , 2005, BMC Bioinformatics.

[34]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[35]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[36]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[37]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[38]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[39]  Ryan M. Rifkin,et al.  Value Regularization and Fenchel Duality , 2007, J. Mach. Learn. Res..

[40]  Volker Roth,et al.  Improved functional prediction of proteins by learning kernel combinations in multilabel settings , 2007, BMC Bioinformatics.

[41]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[42]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[43]  Cheng Soon Ong,et al.  Multiclass multiple kernel learning , 2007, ICML '07.

[44]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[45]  Jean-Philippe Vert,et al.  Supervised reconstruction of biological networks with local models , 2007, ISMB/ECCB.

[46]  Yves Grandvalet,et al.  More efficiency in multiple kernel learning , 2007, ICML '07.

[47]  Cheng Soon Ong,et al.  An Automated Combination of Kernels for Predicting Protein Subcellular Localization , 2007, WABI.

[48]  Arthur Gretton,et al.  Kernel Learning: Automatic Selection of Optimal Kernels , 2008, NIPS 2008.

[49]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[50]  G. Weber LEARNING WITH INFINITELY MANY KERNELS VIA SEMI-INFINITE PROGRAMMING , 2008 .

[51]  Volker Roth,et al.  The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms , 2008, ICML '08.

[52]  Yves Grandvalet,et al.  Composite kernel learning , 2008, ICML '08.

[53]  M. Kloft,et al.  Non-sparse Multiple Kernel Learning , 2008 .

[54]  O. Chapelle Second order optimization of kernel parameters , 2008 .

[55]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[56]  Ethem Alpaydin,et al.  Localized multiple kernel learning , 2008, ICML '08.

[57]  Sebastian Nowozin,et al.  Infinite Kernel Learning , 2008, NIPS 2008.

[58]  Zenglin Xu,et al.  An Extended Level Method for Efficient Multiple Kernel Learning , 2008, NIPS.

[59]  Jieping Ye,et al.  Multi-label Multiple Kernel Learning , 2008, NIPS.

[60]  K. R. Ramakrishnan,et al.  On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation , 2009, NIPS.

[61]  Mehryar Mohri,et al.  Learning Non-Linear Combinations of Kernels , 2009, NIPS.

[62]  Shinichi Nakajima,et al.  Feature Selection for Density Level-Sets , 2009, ECML/PKDD.

[63]  Mehryar Mohri,et al.  L2 Regularization for Learning Kernels , 2009, UAI.

[64]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[65]  Colin Campbell,et al.  Class Prediction from Disparate Biological Data Sources Using an Iterative Multi-Kernel Algorithm , 2009, PRIB.

[66]  Ambuj Tewari,et al.  Applications of strong convexity--strong smoothness duality to learning with matrices , 2009, ArXiv.

[67]  Klaus-Robert Müller,et al.  Efficient and Accurate Lp-Norm Multiple Kernel Learning , 2009, NIPS.

[68]  Mehryar Mohri,et al.  Generalization Bounds for Learning Kernels , 2010, ICML.

[69]  A. Gelman Causality and Statistical Learning , 2010 .

[70]  Zenglin Xu,et al.  Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[71]  Mehryar Mohri,et al.  Two-Stage Learning Kernel Algorithms , 2010, ICML.

[72]  Peter L. Bartlett,et al.  A Unifying View of Multiple Kernel Learning , 2010, ECML/PKDD.

[73]  Johan A. K. Suykens,et al.  L2-norm multiple kernel learning and its application to biomedical data fusion , 2010, BMC Bioinformatics.

[74]  Gunnar Rätsch,et al.  The SHOGUN Machine Learning Toolbox , 2010, J. Mach. Learn. Res..

[75]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.