论文信息 - A Randomized Mirror Descent Algorithm for Large Scale Multiple Kernel Learning

A Randomized Mirror Descent Algorithm for Large Scale Multiple Kernel Learning

We consider the problem of simultaneously learning to linearly combine a very large number of kernels and learn a good predictor based on the learnt kernel. When the number of kernels $d$ to be combined is very large, multiple kernel learning methods whose computational cost scales linearly in $d$ are intractable. We propose a randomized version of the mirror descent algorithm to overcome this issue, under the objective of minimizing the group $p$-norm penalized empirical risk. The key to achieve the required exponential speed-up is the computationally efficient construction of low-variance estimates of the gradient. We propose importance sampling based estimates, and find that the ideal distribution samples a coordinate with a probability proportional to the magnitude of the corresponding gradient. We show the surprising result that in the case of learning the coefficients of a polynomial kernel, the combinatorial structure of the base kernels to be combined allows the implementation of sampling from this distribution to run in $O(\log(d))$ time, making the total computational cost of the method to achieve an $\epsilon$-optimal solution to be $O(\log(d)/\epsilon^2)$, thereby allowing our method to operate for very large values of $d$. Experiments with simulated and real data confirm that the new algorithm is computationally more efficient than its state-of-the-art alternatives.

[1] N. Aronszajn. Theory of Reproducing Kernels. , 1950 .

[2] Aldric L. Brown,et al. Elements of Functional Analysis , 2014 .

[3] R. Rockafellar. Monotone Operators and the Proximal Point Algorithm , 1976 .

[4] B. Martinet. Perturbation des méthodes d'optimisation. Applications , 1978 .

[5] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[6] Marc Teboulle,et al. Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[7] Anthony Widjaja,et al. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[8] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2003, ICTAI.

[9] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[10] Charles A. Micchelli,et al. Learning the Kernel Function via Regularization , 2005, J. Mach. Learn. Res..

[11] J. Franklin,et al. The elements of statistical learning: data mining, inference and prediction , 2005 .

[12] Charles A. Micchelli,et al. Learning Convex Combinations of Continuously Parameterized Basic Kernels , 2005, COLT.

[13] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[14] Gábor Lugosi,et al. Prediction, learning, and games , 2006 .

[15] Charles A. Micchelli,et al. A DC-programming algorithm for kernel selection , 2006, ICML.

[16] Gunnar Rätsch,et al. Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[17] Elad Hazan,et al. Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[18] Francis R. Bach,et al. Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning , 2008, NIPS.

[19] Sebastian Nowozin,et al. Infinite Kernel Learning , 2008, NIPS 2008.

[20] Zenglin Xu,et al. An Extended Level Method for Efficient Multiple Kernel Learning , 2008, NIPS.

[21] K. R. Ramakrishnan,et al. On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation , 2009, NIPS.

[22] Mehryar Mohri,et al. Learning Non-Linear Combinations of Kernels , 2009, NIPS.

[23] Ambuj Tewari,et al. Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[24] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[25] Zenglin Xu,et al. Simple and Efficient Multiple Kernel Learning by Group Lasso , 2010, ICML.

[26] Francesco Orabona,et al. Ultra-Fast Optimization Algorithm for Sparse Multi Kernel Learning , 2011, ICML.

[27] Elad Hazan,et al. An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[28] Alexander Zien,et al. lp-Norm Multiple Kernel Learning , 2011, J. Mach. Learn. Res..

[29] M. Kloft,et al. l p -Norm Multiple Kernel Learning , 2011 .

[30] Ethem Alpaydin,et al. Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[31] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[32] Michael H. Bowling,et al. A Randomized Mirror Descent Algorithm for Large Scale Multiple Kernel Learning , 2012, ICML.

[33] Peter Richtárik,et al. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[34] Yurii Nesterov,et al. Subgradient methods for huge-scale optimization problems , 2013, Mathematical Programming.