Data-driven Random Fourier Features using Stein Effect

Large-scale kernel approximation is an important problem in machine learning research. Approaches using random Fourier features have become increasingly popular [Rahimi and Recht, 2007], where kernel approximation is treated as empirical mean estimation via Monte Carlo (MC) or Quasi-Monte Carlo (QMC) integration [Yang et al., 2014]. A limitation of the current approaches is that all the features receive an equal weight summing to 1. In this paper, we propose a novel shrinkage estimator from "Stein effect", which provides a data-driven weighting strategy for random features and enjoys theoretical justifications in terms of lowering the empirical risk. We further present an efficient randomized algorithm for large-scale applications of the proposed method. Our empirical results on six benchmark data sets demonstrate the advantageous performance of this approach over representative baselines in both kernel approximation and supervised learning tasks.

[1]  Zhihua Zhang,et al.  Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling , 2013, J. Mach. Learn. Res..

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[4]  John C. Duchi,et al.  Learning Kernels with Random Features , 2016, NIPS.

[5]  David Duvenaud,et al.  Optimally-Weighted Herding is Bayesian Quadrature , 2012, UAI.

[6]  Francis R. Bach,et al.  On the Equivalence between Quadrature Rules and Random Features , 2015, ArXiv.

[7]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[8]  Alexander J. Smola,et al.  Fastfood - Computing Hilbert Space Expansions in loglinear time , 2013, ICML.

[9]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[10]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[11]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[12]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[13]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Rong Jin,et al.  Efficient Kernel Clustering Using Random Fourier Features , 2012, 2012 IEEE 12th International Conference on Data Mining.

[15]  Frances Y. Kuo,et al.  High-dimensional integration: The quasi-Monte Carlo way*† , 2013, Acta Numerica.

[16]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[17]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[18]  Bernhard Schölkopf,et al.  Kernel Mean Estimation and Stein Effect , 2013, ICML.

[19]  Chun-Liang Li,et al.  Utilize Old Coordinates: Faster Doubly Stochastic Gradients for Kernel Methods , 2016, UAI.

[20]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[21]  David P. Woodruff,et al.  Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[22]  Shusen Wang,et al.  Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging , 2017, ICML.

[23]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[24]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[25]  Le Song,et al.  Scalable Kernel Methods via Doubly Stochastic Gradients , 2014, NIPS.

[26]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[27]  David P. Woodruff,et al.  Sketching Structured Matrices for Faster Nonlinear Regression , 2013, NIPS.

[28]  Carl E. Rasmussen,et al.  Bayesian Monte Carlo , 2002, NIPS.