Generalization Properties of Learning with Random Features

We study the generalization properties of ridge regression with random features in the statistical learning framework. We show for the first time that $O(1/\sqrt{n})$ learning bounds can be achieved with only $O(\sqrt{n}\log n)$ random features rather than $O({n})$ as suggested by previous results. Further, we prove faster learning rates and show that they might require more random features, unless they are sampled according to a possibly problem dependent distribution. Our results shed light on the statistical computational trade-offs in large scale kernelized learning, showing the potential effectiveness of random features in reducing the computational complexity while keeping optimal generalization properties.

[1]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[2]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[3]  Steven A. Orszag,et al.  CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS , 1978 .

[4]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[5]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[6]  G. Wahba Spline models for observational data , 1990 .

[7]  J. Fujii,et al.  Norm inequalities equivalent to Heinz inequality , 1993 .

[8]  V. Yurinsky Sums and Gaussian Vectors , 1995 .

[9]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[10]  H. Kosaki Arithmetic–Geometric Mean and Related Inequalities for Operators , 1998 .

[11]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[12]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[13]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[14]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[15]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[16]  S. Smale,et al.  ESTIMATING THE APPROXIMATION ERROR IN LEARNING THEORY , 2003 .

[17]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[18]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[19]  Lorenzo Rosasco,et al.  Model Selection for Regularized Least-Squares Algorithm in Learning Theory , 2005, Found. Comput. Math..

[20]  Lorenzo Rosasco,et al.  Learning from Examples as an Inverse Problem , 2005, J. Mach. Learn. Res..

[21]  Don R. Hush,et al.  An Explicit Description of the Reproducing Kernel Hilbert Spaces of Gaussian RBF Kernels , 2006, IEEE Transactions on Information Theory.

[22]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[23]  Y. Yao,et al.  Adaptation for Regularization Operators in Learning Theory , 2006 .

[24]  Lorenzo Rosasco,et al.  On regularization algorithms in learning theory , 2007, J. Complex..

[25]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[26]  Ding-Xuan Zhou,et al.  Learning Theory: An Approximation Theory Viewpoint , 2007 .

[27]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[28]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[29]  S. Smale,et al.  Learning Theory Estimates via Integral Operators and Their Approximations , 2007 .

[30]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[31]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[32]  Lorenzo Rosasco,et al.  Spectral Algorithms for Supervised Learning , 2008, Neural Computation.

[33]  Svetlana Lazebnik,et al.  Locality-sensitive binary codes from shift-invariant kernels , 2009, NIPS.

[34]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[35]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[36]  Gilles Blanchard,et al.  Optimal learning rates for Kernel Conjugate Gradient regression , 2010, NIPS.

[37]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[39]  Stanislav Minsker On Some Extensions of Bernstein's Inequality for Self-adjoint Operators , 2011, 1112.5448.

[40]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[41]  Harish Karnick,et al.  Random Feature Maps for Dot Product Kernels , 2012, AISTATS.

[42]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[43]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[44]  J. Tropp User-Friendly Tools for Random Matrices: An Introduction , 2012 .

[45]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[46]  Alexander J. Smola,et al.  Fastfood - Computing Hilbert Space Expansions in loglinear time , 2013, ICML.

[47]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[48]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[49]  Quanfu Fan,et al.  Random Laplace Feature Maps for Semigroup Kernels on Histograms , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Michael W. Mahoney,et al.  Fast Randomized Kernel Methods With Statistical Guarantees , 2014, ArXiv.

[51]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[52]  Yaniv Plan,et al.  Dimension Reduction by Random Hyperplane Tessellations , 2014, Discret. Comput. Geom..

[53]  Dennis DeCoste,et al.  Compact Random Feature Maps , 2013, ICML.

[54]  Francis R. Bach,et al.  On the Equivalence between Quadrature Rules and Random Features , 2015, ArXiv.

[55]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[56]  Zoltán Szabó,et al.  Optimal Rates for Random Fourier Features , 2015, NIPS.

[57]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[58]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[59]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[60]  Lea Fleischer,et al.  Regularization of Inverse Problems , 1996 .