On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions

We show that kernel-based quadrature rules for computing integrals can be seen as a special case of random feature expansions for positive definite kernels, for a particular decomposition that always exists for such kernels. We provide a theoretical analysis of the number of required samples for a given approximation error, leading to both upper and lower bounds that are based solely on the eigenvalues of the associated integral operator and match up to logarithmic terms. In particular, we show that the upper bound may be obtained from independent and identically distributed samples from a specific non-uniform distribution, while the lower bound if valid for any set of points. Applying our results to kernel-based quadrature, while our results are fairly general, we recover known upper and lower bounds for the special cases of Sobolev spaces. Moreover, our results extend to the more general problem of full function approximations (beyond simply computing an integral), with results in L2- and L$\infty$-norm that match known results for special cases. Applying our results to random features, we show an improvement of the number of random features needed to preserve the generalization guarantees for learning with Lipschitz-continuous losses.

[1]  Kerstin Hesse,et al.  A lower bound for the worst-case cubature error on spheres of arbitrary dimension , 2006, Numerische Mathematik.

[2]  C. Baker Joint measures and cross-covariance operators , 1973 .

[3]  E. Novak Deterministic and Stochastic Error Bounds in Numerical Analysis , 1988 .

[4]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[5]  Michael W. Mahoney Boyd,et al.  Randomized Algorithms for Matrices and Data , 2010 .

[6]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[7]  Adam D. Bull,et al.  Convergence Rates of Efficient Global Optimization Algorithms , 2011, J. Mach. Learn. Res..

[8]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[9]  Daniel J. Hsu,et al.  Tail inequalities for sums of random matrices that depend on the intrinsic dimension , 2012 .

[10]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[11]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[12]  M. Savard Bach , 1985 .

[13]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[14]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[15]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[16]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[17]  Michael Langberg,et al.  Universal epsilon-approximators for Integrals , 2010, ACM-SIAM Symposium on Discrete Algorithms.

[18]  Alexander J. Smola,et al.  Super-Samples from Kernel Herding , 2010, UAI.

[19]  Tosio Kato Perturbation theory for linear operators , 1966 .

[20]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[21]  Russel E. Caflisch,et al.  Quasi-Random Sequences and Their Discrepancies , 1994, SIAM J. Sci. Comput..

[22]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[23]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[24]  F. Bach,et al.  Non-parametric Stochastic Approximation with Large Step sizes , 2014, 1408.0361.

[25]  Doug Nychka,et al.  A framework to understand the asymptotic properties of Kriging and splines , 2007 .

[26]  Baver Okutmustur Reproducing kernel Hilbert spaces , 2005 .

[27]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[28]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[29]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[30]  A. O'Hagan,et al.  Bayes–Hermite quadrature , 1991 .

[31]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[32]  William G. Cochran,et al.  Experimental Designs, 2nd Edition , 1950 .

[33]  G. Wahba Spline models for observational data , 1990 .

[34]  Zaïd Harchaoui,et al.  Testing for Homogeneity with Kernel Fisher Discriminant Analysis , 2007, NIPS.

[35]  A. Caponnetto,et al.  Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..

[36]  S. Mendelson,et al.  Regularization in kernel learning , 2010, 1001.2094.

[37]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[38]  B. Simon Trace ideals and their applications , 1979 .

[39]  F. B. Hildebrand,et al.  Introduction To Numerical Analysis , 1957 .

[40]  Don R. Hush,et al.  Optimal Rates for Regularized Least Squares Regression , 2009, COLT.

[41]  David Duvenaud,et al.  Optimally-Weighted Herding is Bayesian Quadrature , 2012, UAI.

[42]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[43]  Gilles Blanchard,et al.  Kernel Projection Machine: a New Tool for Pattern Recognition , 2004, NIPS.

[44]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[45]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[46]  Andrew Zisserman,et al.  Advances in Neural Information Processing Systems (NIPS) , 2007 .

[47]  Begnaud Francis Hildebrand,et al.  Introduction to numerical analysis: 2nd edition , 1987 .

[48]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[49]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[50]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[51]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.

[52]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[53]  D. Cruz-Uribe,et al.  SHARP ERROR BOUNDS FOR THE TRAPEZOIDAL RULE AND SIMPSON'S RULE , 2002 .

[54]  R. Bhatia Positive Definite Matrices , 2007 .

[55]  H. König Eigenvalues of compact operators with applications to integral operators , 1986 .

[56]  Francis R. Bach,et al.  On the Equivalence between Herding and Conditional Gradient Algorithms , 2012, ICML.

[57]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[58]  Hidemitsu Ogawa,et al.  An operator pseudo-inversion lemma , 1988 .

[59]  Alexander J. Smola,et al.  Regularization with Dot-Product Kernels , 2000, NIPS.

[60]  Michael W. Mahoney,et al.  Fast Randomized Kernel Methods With Statistical Guarantees , 2014, ArXiv.

[61]  H. Widom Asymptotic behavior of the eigenvalues of certain integral equations. II , 1964 .

[62]  Petros Drineas,et al.  CUR matrix decompositions for improved data analysis , 2009, Proceedings of the National Academy of Sciences.

[63]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[64]  M. Girolami,et al.  Variance Reduction for Quasi-Monte Carlo , 2015 .

[65]  R. Tibshirani,et al.  Generalized Additive Models , 1986 .

[66]  Stanislav Minsker On Some Extensions of Bernstein's Inequality for Self-adjoint Operators , 2011, 1112.5448.

[67]  A. Baron Experimental Designs , 1990, The Behavior analyst.

[68]  H. Widom Asymptotic behavior of the eigenvalues of certain integral equations , 1963 .

[69]  M. Birman,et al.  ESTIMATES OF SINGULAR NUMBERS OF INTEGRAL OPERATORS , 1977 .

[70]  Christian P. Robert,et al.  Monte Carlo Statistical Methods (Springer Texts in Statistics) , 2005 .