Optimal Rates for Random Fourier Features

Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate this serious computational limitation, recently randomized constructions have been proposed in the literature, which allow the application of fast linear algorithms. Random Fourier features (RFF) are among the most popular and widely applied constructions: they provide an easily computable, low-dimensional feature representation for shift-invariant kernels. Despite the popularity of RFFs, very little is understood theoretically about their approximation quality. In this paper, we provide a detailed finite-sample theoretical analysis about the approximation quality of RFFs by (i) establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and (ii) presenting guarantees in Lr (1 ≤ r < ∞) norms. We also propose an RFF approximation to derivatives of a kernel with a theoretical study on its approximation quality.

[1]  A. Feuerverger,et al.  The Empirical Characteristic Function and Its Applications , 1977 .

[2]  S. Csörgo Multivariate empirical characteristic functions , 1981 .

[3]  Gerald B. Folland,et al.  Real Analysis: Modern Techniques and Their Applications , 1984 .

[4]  J. Yukich Some limit theorems for the empirical process indexed by functions , 1987 .

[5]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[6]  V. Yurinsky Sums and Gaussian Vectors , 1995 .

[7]  S. Geer Empirical Processes in M-Estimation , 2000 .

[8]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[9]  S. R. Jammalamadaka,et al.  Empirical Processes in M-Estimation , 2001 .

[10]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[11]  O. Bousquet New approaches to statistical learning theory , 2003 .

[12]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[13]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[14]  N. Carothers A short course on Banach space theory , 2004 .

[15]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[16]  J. K. Hunter,et al.  Measure Theory , 2007 .

[17]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[18]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[19]  A. Rahimi,et al.  Uniform approximation of functions with random bases , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[20]  Ding-Xuan Zhou Derivative reproducing properties for kernel methods in learning theory , 2008 .

[21]  John Langford,et al.  Hash Kernels , 2009, AISTATS.

[22]  Lorenzo Rosasco,et al.  A Regularization Approach to Nonlinear Variable Selection , 2010, AISTATS.

[23]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Lei Shi,et al.  Hermite learning with gradient data , 2010, J. Comput. Appl. Math..

[25]  E. Stein,et al.  Functional Analysis: Introduction to Further Topics in Analysis , 2011 .

[26]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[27]  Nathan Srebro,et al.  Explicit Approximations of the Gaussian Kernel , 2011, ArXiv.

[28]  Colin Campbell,et al.  Learning the coordinate gradients , 2011, Advances in Computational Mathematics.

[29]  Kristen Grauman,et al.  Kernelized Locality-Sensitive Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Alexander J. Smola,et al.  Fastfood - Computing Hilbert Space Expansions in loglinear time , 2013, ICML.

[31]  Lorenzo Rosasco,et al.  Nonparametric sparsity and regularization , 2012, J. Mach. Learn. Res..

[32]  Subhransu Maji,et al.  Efficient Classification for Additive Kernel SVMs , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  D. L. Cohn Measure Theory: Second Edition , 2013 .

[34]  Michael W. Mahoney,et al.  Fast Randomized Kernel Methods With Statistical Guarantees , 2014, ArXiv.

[35]  Arthur Gretton,et al.  Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families , 2015, NIPS.

[36]  Jeff G. Schneider,et al.  On the Error of Random Fourier Features , 2015, UAI.

[37]  Bernhard Schölkopf,et al.  Towards a Learning Theory of Causation , 2015, 1502.02398.

[38]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[39]  Barnabás Póczos,et al.  Fast Function to Function Regression , 2014, AISTATS.

[40]  M. Urner Scattered Data Approximation , 2016 .

[41]  Aapo Hyvärinen,et al.  Density Estimation in Infinite Dimensional Exponential Families , 2013, J. Mach. Learn. Res..