On Sampling Random Features From Empirical Leverage Scores: Implementation and Theoretical Guarantees

Random features provide a practical framework for large-scale kernel approximation and supervised learning. It has been shown that data-dependent sampling of random features using leverage scores can significantly reduce the number of features required to achieve optimal learning bounds. Leverage scores introduce an optimized distribution for features based on an infinite-dimensional integral operator (depending on input distribution), which is impractical to sample from. Focusing on empirical leverage scores in this paper, we establish an out-of-sample performance bound, revealing an interesting trade-off between the approximated kernel and the eigenvalue decay of another kernel in the domain of random features defined based on data distribution. Our experiments verify that the empirical algorithm consistently outperforms vanilla Monte Carlo sampling, and with a minor modification the method is even competitive to supervised data-dependent kernel learning, without using the output (label) information.

[1]  Vahid Tarokh,et al.  On Data-Dependent Random Features for Improved Generalization in Supervised Learning , 2017, AAAI.

[2]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[3]  Nathan Srebro,et al.  Explicit Approximations of the Gaussian Kernel , 2011, ArXiv.

[4]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[5]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[6]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[7]  John C. Duchi,et al.  Learning Kernels with Random Features , 2016, NIPS.

[8]  Yiming Yang,et al.  Data-driven Random Fourier Features using Stein Effect , 2017, IJCAI.

[9]  Yuan Yao,et al.  Mercer's Theorem, Feature Maps, and Smoothing , 2006, COLT.

[10]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[11]  Larry S. Davis,et al.  Efficient Kernel Machines Using the Improved Fast Gauss Transform , 2004, NIPS.

[12]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[13]  Mikio L. Braun,et al.  Spectral properties of the kernel matrix and their relation to kernel methods in machine learning , 2005 .

[14]  Yi Zhang,et al.  Not-So-Random Features , 2017, ICLR.

[15]  Trevor Campbell,et al.  Data-dependent compression of random features for large-scale kernel approximation , 2019, AISTATS.

[16]  Shih-Fu Chang,et al.  Compact Nonlinear Maps and Circulant Extensions , 2015, ArXiv.

[17]  Zhenyu Liao,et al.  On the Spectrum of Random Features Maps of High Dimensional Data , 2018, ICML.

[18]  Le Song,et al.  A la Carte - Learning Fast Kernels , 2014, AISTATS.

[19]  Ambuj Tewari,et al.  But How Does It Work in Theory? Linear SVM with Random Features , 2018, NeurIPS.

[20]  Barnabás Póczos,et al.  Bayesian Nonparametric Kernel-Learning , 2015, AISTATS.

[21]  Daniele Calandriello,et al.  On Fast Leverage Score Sampling and Optimal Learning , 2018, NeurIPS.

[22]  Shou-De Lin,et al.  Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space , 2014, NIPS.

[23]  Ameya Velingker,et al.  Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[24]  Zhu Li,et al.  Towards a Unified Analysis of Random Fourier Features , 2018, ICML.

[25]  Harish Karnick,et al.  Random Feature Maps for Dot Product Kernels , 2012, AISTATS.

[26]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[27]  Vahid Tarokh,et al.  Learning Bounds for Greedy Approximation with Explicit Feature Maps from Multiple Kernels , 2018, NeurIPS.

[28]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[29]  Andrew Zisserman,et al.  Efficient Additive Kernels via Explicit Feature Maps , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[31]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[32]  Inderjit S. Dhillon,et al.  Goal-Directed Inductive Matrix Completion , 2016, KDD.

[33]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[34]  José Carlos Príncipe,et al.  An Explicit Construction Of A Reproducing Gaussian Kernel Hilbert Space , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[35]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[36]  Inderjit S. Dhillon,et al.  Memory Efficient Kernel Approximation , 2014, ICML.

[37]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[38]  Zhu Li,et al.  A Unified Analysis of Random Fourier Features , 2018, ArXiv.

[39]  Lorenzo Rosasco,et al.  Learning with SGD and Random Features , 2018, NeurIPS.

[40]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[41]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[42]  Bernhard Schölkopf,et al.  The Randomized Dependence Coefficient , 2013, NIPS.

[43]  Quanfu Fan,et al.  Random Laplace Feature Maps for Semigroup Kernels on Histograms , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.