Large-scale Online Kernel Learning with Random Feature Reparameterization

A typical online kernel learning method faces two fundamental issues: the complexity in dealing with a huge number of observed data points (a.k.a the curse of kernelization) and the difficulty in learning kernel parameters, which often assumed to be fixed. Random Fourier feature is a recent and effective approach to address the former by approximating the shift-invariant kernel function via Bocher’s theorem, and allows the model to be maintained directly in the random feature space with a fixed dimension, hence the model size remains constant w.r.t. data size. We further introduce in this paper the reparameterized random feature (RRF), a random feature framework for large-scale online kernel learning to address both aforementioned challenges. Our initial intuition comes from the so-called ‘reparameterization trick’ [Kingma and Welling, 2014] to lift the source of randomness of Fourier components to another space which can be independently sampled, so that stochastic gradient of the kernel parameters can be analytically derived. We develop a well-founded underlying theory for our method, including a general way to reparameterize the kernel, and a new tighter error bound on the approximation quality. This view further inspires a direct application of stochastic gradient descent for updating our model under an online learning setting. We then conducted extensive experiments on several large-scale datasets where we demonstrate that our work achieves state-of-the-art performance in both learning efficacy and efficiency.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[3]  Slobodan Vucetic,et al.  Twin Vector Machines for Online Learning on a Budget , 2009, SDM.

[4]  Neil D. Lawrence,et al.  Gaussian Processes for Big Data , 2013, UAI.

[5]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[6]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Fixed Budget , 2005, NIPS.

[7]  Trung Le,et al.  Nonparametric Budgeted Stochastic Gradient Descent , 2016, AISTATS.

[8]  Cristian Sminchisescu,et al.  Fourier Kernel Learning , 2012, ECCV.

[9]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[10]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[11]  Andrew Gordon Wilson,et al.  Gaussian Process Kernels for Pattern Discovery and Extrapolation , 2013, ICML.

[12]  Steven C. H. Hoi,et al.  Large Scale Online Kernel Learning , 2016, J. Mach. Learn. Res..

[13]  Steven C. H. Hoi,et al.  Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning , 2012, ICML.

[14]  Barnabás Póczos,et al.  Bayesian Nonparametric Kernel-Learning , 2015, AISTATS.

[15]  Claudio Gentile,et al.  Tracking the best hyperplane with a simple budget Perceptron , 2006, Machine Learning.

[16]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[17]  Koby Crammer,et al.  Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training , 2012, J. Mach. Learn. Res..

[18]  Trung Le,et al.  Multiple Kernel Learning with Data Augmentation , 2016, ACML.

[19]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[20]  Slobodan Vucetic,et al.  Online Passive-Aggressive Algorithms on a Budget , 2010, AISTATS.

[21]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[22]  Barbara Caputo,et al.  Bounded Kernel-Based Online Learning , 2009, J. Mach. Learn. Res..

[23]  T. Murray,et al.  Volume 9 , 1998 .

[24]  Le Song,et al.  A la Carte - Learning Fast Kernels , 2014, AISTATS.

[25]  Trung Le,et al.  Dual Space Gradient Descent for Online Learning , 2016, NIPS.

[26]  Alexander J. Smola,et al.  Neural Information Processing Systems , 1997, NIPS 1997.

[27]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[28]  Luc Van Gool,et al.  European conference on computer vision (ECCV) , 2006, eccv 2006.