A Randomized Mirror Descent Algorithm for Large Scale Multiple Kernel Learning

We consider the problem of simultaneously learning to linearly combine a very large number of kernels and learn a good predictor based on the learnt kernel. When the number of kernels $d$ to be combined is very large, multiple kernel learning methods whose computational cost scales linearly in $d$ are intractable. We propose a randomized version of the mirror descent algorithm to overcome this issue, under the objective of minimizing the group $p$-norm penalized empirical risk. The key to achieve the required exponential speed-up is the computationally efficient construction of low-variance estimates of the gradient. We propose importance sampling based estimates, and find that the ideal distribution samples a coordinate with a probability proportional to the magnitude of the corresponding gradient. We show the surprising result that in the case of learning the coefficients of a polynomial kernel, the combinatorial structure of the base kernels to be combined allows the implementation of sampling from this distribution to run in $O(\log(d))$ time, making the total computational cost of the method to achieve an $\epsilon$-optimal solution to be $O(\log(d)/\epsilon^2)$, thereby allowing our method to operate for very large values of $d$. Experiments with simulated and real data confirm that the new algorithm is computationally more efficient than its state-of-the-art alternatives.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  Aldric L. Brown,et al.  Elements of Functional Analysis , 2014 .

[3]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[4]  B. Martinet Perturbation des méthodes d'optimisation. Applications , 1978 .

[5]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[6]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[7]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[8]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[9]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[10]  Charles A. Micchelli,et al.  Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[11]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[12]  Charles A. Micchelli,et al.  Learning Convex Combinations of Continuously Parameterized Basic Kernels , 2005, COLT.

[13]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[14]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[15]  Charles A. Micchelli,et al.  A DC-programming algorithm for kernel selection , 2006, ICML.

[16]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[17]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[18]  Francis R. Bach,et al.  Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[19]  Sebastian Nowozin,et al.  Infinite Kernel Learning , 2008, NIPS 2008.

[20]  Zenglin Xu,et al.  An Extended Level Method for Efficient Multiple Kernel Learning , 2008, NIPS.

[21]  K. R. Ramakrishnan,et al.  On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation , 2009, NIPS.

[22]  Mehryar Mohri,et al.  Learning Non-Linear Combinations of Kernels , 2009, NIPS.

[23]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[24]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[25]  Zenglin Xu,et al.  Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[26]  Francesco Orabona,et al.  Ultra-Fast Optimization Algorithm for Sparse Multi Kernel Learning , 2011, ICML.

[27]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[28]  Alexander Zien,et al.  lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[29]  M. Kloft,et al.  l p -Norm Multiple Kernel Learning , 2011 .

[30]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[31]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[32]  Michael H. Bowling,et al.  A Randomized Mirror Descent Algorithm for Large Scale Multiple Kernel Learning , 2012, ICML.

[33]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[34]  Yurii Nesterov,et al.  Subgradient methods for huge-scale optimization problems , 2013, Mathematical Programming.