Non-Sparse Regularization and Efficient Training with Multiple Kernels

Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability and scalability. Unfortunately, this `1-norm MKL is rarely observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary norms, like `p-norms with p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the commonly used wrapper approaches. An experiment on controlled artificial data experiment sheds light on the appropriateness of sparse, non-sparse and `∞ MKL in various scenarios. Application of `p-norm MKL to three hard real-world problems from computational biology show that non-sparse MKL achieves accuracies that go beyond the state-of-the-art. We conclude that our improvements finally made MKL fit for deployment to practical applications: MKL now has a good chance of improving the accuracy (over a plain sum kernel) at an affordable computational cost.

[1]  Mehryar Mohri,et al.  Learning Non-Linear Combinations of Kernels , 2009, NIPS.

[2]  Mikhail V. Solodov,et al.  On local convergence of sequential quadratically-constrained quadratic-programming type methods, with an extension to variational problems , 2008, Comput. Optim. Appl..

[3]  Gunnar Rätsch,et al.  Learning Interpretable SVMs for Biological Sequence Classification , 2006, BMC Bioinformatics.

[4]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[5]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[6]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[7]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[8]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[9]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[10]  Motoaki Kawanabe,et al.  Multiple Kernel Learning for Object Classification , 2009 .

[11]  Olivier Bousquet,et al.  On the Complexity of Learning the Kernel Matrix , 2002, NIPS.

[12]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[13]  Yves Grandvalet,et al.  Composite kernel learning , 2008, ICML '08.

[14]  Mihai Anitescu,et al.  A Superlinearly Convergent Sequential Quadratically Constrained Quadratic Programming Algorithm for Degenerate Nonlinear Programming , 2002, SIAM J. Optim..

[15]  Sören Sonnenburg,et al.  Optimized cutting plane algorithm for support vector machines , 2008, ICML '08.

[16]  Vitalii P. Tanana,et al.  Theory of Linear Ill-Posed Problems and its Applications , 2002 .

[17]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[18]  Chiranjib Bhattacharyya,et al.  Variable Sparsity Kernel Learning , 2011, J. Mach. Learn. Res..

[19]  Masao Fukushima,et al.  A Sequential Quadratically Constrained Quadratic Programming Method for Differentiable Convex Minimization , 2003, SIAM J. Optim..

[20]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[21]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming: Theory, Methods, and Applications , 1993, SIAM Rev..

[22]  G. Weber LEARNING WITH INFINITELY MANY KERNELS VIA SEMI-INFINITE PROGRAMMING , 2008 .

[23]  Manik Varma,et al.  More generality in efficient multiple kernel learning , 2009, ICML '09.

[24]  Jieping Ye,et al.  Multi-label Multiple Kernel Learning , 2008, NIPS.

[25]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[26]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[27]  Cheng Soon Ong,et al.  An Automated Combination of Kernels for Predicting Protein Subcellular Localization , 2007, WABI.

[28]  O. Chapelle Second order optimization of kernel parameters , 2008 .

[29]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[30]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[31]  K. R. Ramakrishnan,et al.  On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation , 2009, NIPS.

[32]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[33]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[34]  Klaus-Robert Müller,et al.  Efficient and Accurate Lp-Norm Multiple Kernel Learning , 2009, NIPS.

[35]  Kenta Nakai,et al.  DBTSS: DataBase of human Transcriptional Start Sites and full-length cDNAs , 2002, Nucleic Acids Res..

[36]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[37]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[38]  Zenglin Xu,et al.  An Extended Level Method for Efficient Multiple Kernel Learning , 2008, NIPS.

[39]  Gunnar Rätsch,et al.  Learning Interpretable SVMs for Biological Sequence Classification , 2005, BMC Bioinformatics.

[40]  Marius Kloft,et al.  Automatic feature selection for anomaly detection , 2008, AISec '08.

[41]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[42]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[43]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[44]  Jean-Philippe Vert,et al.  Supervised reconstruction of biological networks with local models , 2007, ISMB/ECCB.

[45]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[46]  Sebastian Nowozin,et al.  Infinite Kernel Learning , 2008, NIPS 2008.

[47]  Josef Kittler,et al.  A Comparison of L_1 Norm and L_2 Norm Multiple Kernel SVMs in Image and Video Classification , 2009, 2009 Seventh International Workshop on Content-Based Multimedia Indexing.

[48]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[49]  Manik Varma,et al.  Learning The Discriminative Power-Invariance Trade-Off , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[50]  Volker Roth,et al.  The Group-Lasso for generalized linear models: uniqueness of solutions and efficient algorithms , 2008, ICML '08.

[51]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[52]  Yoshihiro Yamanishi,et al.  Supervised enzyme network inference from the integration of genomic data and chemical information , 2005, ISMB.

[53]  Alexander J. Smola,et al.  Learning the Kernel with Hyperkernels , 2005, J. Mach. Learn. Res..

[54]  Yvan Saeys,et al.  Toward a gold standard for promoter prediction evaluation , 2009, Bioinform..

[55]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[56]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[57]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[58]  Sin Lam Tan,et al.  Promoter prediction analysis on the whole human genome , 2004, Nature Biotechnology.

[59]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[60]  Mikhail V. Solodov,et al.  On the Sequential Quadratically Constrained Quadratic Programming Methods , 2004, Math. Oper. Res..

[61]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[62]  Ryan M. Rifkin,et al.  Value Regularization and Fenchel Duality , 2007, J. Mach. Learn. Res..

[63]  Cheng Soon Ong,et al.  Multiclass multiple kernel learning , 2007, ICML '07.

[64]  Yves Grandvalet,et al.  More efficiency in multiple kernel learning , 2007, ICML '07.

[65]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[66]  Shinichi Nakajima,et al.  Feature Selection for Density Level-Sets , 2009, ECML/PKDD.

[67]  H. Leeb,et al.  Sparse Estimators and the Oracle Property, or the Return of Hodges' Estimator , 2007, 0704.1466.

[68]  Gunnar Rätsch,et al.  ARTS: accurate recognition of transcription starts in human , 2006, ISMB.

[69]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[70]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.