On the Sample Complexity of Random Fourier Features for Online Learning

We study the sample complexity of random Fourier features for online kernel learning—that is, the number of random Fourier features required to achieve good generalization performance. We show that when the loss function is strongly convex and smooth, online kernel learning with random Fourier features can achieve an O(log T /T) bound for the excess risk with only O(1/λ2) random Fourier features, where T is the number of training examples and λ is the modulus of strong convexity. This is a significant improvement compared to the existing result for batch kernel learning that requires O(T) random Fourier features to achieve a generalization bound O(1/√T). Our empirical study verifies that online kernel learning with a limited number of random Fourier features can achieve similar generalization performance as online learning using full kernel matrix. We also present an enhanced online learning algorithm with random Fourier features that improves the classification performance by multiple passes of training examples and a partial average.

[1]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[2]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[3]  Jason Weston,et al.  Large-scale kernel machines , 2007 .

[4]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[5]  Larry S. Davis,et al.  Efficient Kernel Machines Using the Improved Fast Gauss Transform , 2004, NIPS.

[6]  S. Smale,et al.  Geometry on Probability Spaces , 2009 .

[7]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[8]  Gunnar Rätsch,et al.  Large Scale Multiple Kernel Learning , 2006, J. Mach. Learn. Res..

[9]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[10]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[11]  José Carlos Príncipe,et al.  An Explicit Construction Of A Reproducing Gaussian Kernel Hilbert Space , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[12]  Lars Kai Hansen,et al.  A Cure for Variance Inflation in High Dimensional Kernel Principal Component Analysis , 2011, J. Mach. Learn. Res..

[13]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[14]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[15]  Kai Zhang,et al.  Density-Weighted Nyström Method for Computing Large Kernel Eigensystems , 2009, Neural Comput..

[16]  Nathan Srebro,et al.  Learning Optimally Sparse Support Vector Machines , 2013, ICML.

[17]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[18]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[19]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[20]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[21]  Barbara Caputo,et al.  The projectron: a bounded kernel-based Perceptron , 2008, ICML '08.

[22]  ZhangChangshui,et al.  On the Sample Complexity of Random Fourier Features for Online Learning , 2014 .

[23]  J. Andrew Bagnell,et al.  Stability Conditions for Online Learnability , 2011, ArXiv.

[24]  Nathan Srebro,et al.  Explicit Approximations of the Gaussian Kernel , 2011, ArXiv.

[25]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[26]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[27]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[28]  Federico Girosi,et al.  Reducing the run-time complexity of Support Vector Machines , 1999 .

[29]  Jinfeng Yi,et al.  Online Kernel Learning with a Near Optimal Sparsity Bound , 2013, ICML.

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  S. Sathiya Keerthi,et al.  Building Support Vector Machines with Reduced Classifier Complexity , 2006, J. Mach. Learn. Res..

[32]  Rong Jin,et al.  Non-parametric Mixture Models for Clustering , 2010, SSPR/SPR.

[33]  Andrew McCallum,et al.  Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora , 2005 .

[34]  John C. Duchi,et al.  The Generalization Ability of Online Algorithms for Dependent Data , 2011, IEEE Transactions on Information Theory.

[35]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[36]  Ido Dagan,et al.  Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[37]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[38]  Koby Crammer,et al.  Batch Performance for an Online Price , 2007 .

[39]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[40]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[41]  David R. Musicant,et al.  Large Scale Kernel Regression via Linear Programming , 2002, Machine Learning.

[42]  Steven C. H. Hoi,et al.  Fast Bounded Online Gradient Descent Algorithms for Scalable Kernel-Based Online Learning , 2012, ICML.

[43]  Claudio Gentile,et al.  Tracking the best hyperplane with a simple budget Perceptron , 2006, Machine Learning.

[44]  Thomas P. Hayes,et al.  High-Probability Regret Bounds for Bandit Online Linear Optimization , 2008, COLT.

[45]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[46]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[47]  Bernhard Schölkopf,et al.  A Direct Method for Building Sparse Kernel Learning Algorithms , 2006, J. Mach. Learn. Res..

[48]  Ji Zhu,et al.  Kernel Logistic Regression and the Import Vector Machine , 2001, NIPS.

[49]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[50]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[51]  Ameet Talwalkar,et al.  Sampling Methods for the Nyström Method , 2012, J. Mach. Learn. Res..

[52]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Budget , 2008, SIAM J. Comput..

[53]  Ohad Shamir,et al.  Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , 2012, ICML.