Data-dependent compression of random features for large-scale kernel approximation

Kernel methods offer the flexibility to learn complex relationships in modern, large data sets while enjoying strong theoretical guarantees on quality. Unfortunately, these methods typically require cubic running time in the data set size, a prohibitive cost in the large- data setting. Random feature maps (RFMs) and the Nystrom method both consider low- rank approximations to the kernel matrix as a potential solution. But, in order to achieve desirable theoretical guarantees, the former may require a prohibitively large number of features J+, and the latter may be prohibitively expensive for high-dimensional problems. We propose to combine the simplicity and generality of RFMs with a data-dependent feature selection scheme to achieve desirable theoretical approximation properties of Nystrom with just O(log J+) features. Our key insight is to begin with a large set of random features, then reduce them to a small number of weighted features in a data-dependent, computationally efficient way, while preserving the statistical guarantees of using the original large set of features. We demonstrate the efficacy of our method with theory and experiments—including on a data set with over 50 million observations. In particular, we show that our method achieves small kernel matrix approximation error and better test set accuracy with provably fewer random features than state-of-the-art methods.

[1]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[2]  Arthur Gretton,et al.  A Kernel Test of Goodness of Fit , 2016, ICML.

[3]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[4]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[5]  Trevor Campbell,et al.  Bayesian Coreset Construction via Greedy Iterative Geodesic Ascent , 2018, ICML.

[6]  Michael W. Mahoney,et al.  Fast Randomized Kernel Methods With Statistical Guarantees , 2014, ArXiv.

[7]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[8]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[9]  Ameet Talwalkar,et al.  Matrix Approximation for Large-scale Learning , 2010 .

[10]  Santosh S. Vempala,et al.  On Kernels, Margins, and Low-Dimensional Mappings , 2004, ALT.

[11]  Sanjiv Kumar,et al.  Spherical Random Features for Polynomial Kernels , 2015, NIPS.

[12]  Dennis DeCoste,et al.  Compact Random Feature Maps , 2013, ICML.

[13]  Bernhard Schölkopf,et al.  Kernel-based Conditional Independence Test and Application in Causal Discovery , 2011, UAI.

[14]  Jean Honorio,et al.  The Error Probability of Random Fourier Features is Dimensionality Independent , 2017, ArXiv.

[15]  W. Marsden I and J , 2012 .

[16]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[17]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[18]  Ameya Velingker,et al.  Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[19]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[20]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[21]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[22]  Bharath K. Sriperumbudur,et al.  Approximate Kernel PCA Using Random Features: Computational vs. Statistical Trade-off , 2017, 1706.06296.

[23]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[24]  Rong Jin,et al.  Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison , 2012, NIPS.

[25]  J. Lindenstrauss,et al.  Extensions of lipschitz maps into Banach spaces , 1986 .

[26]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[27]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[28]  Kyomin Jung,et al.  Multi-scale Nystrom Method , 2018, AISTATS.

[29]  Yiming Yang,et al.  Data-driven Random Fourier Features using Stein Effect , 2017, IJCAI.

[30]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[31]  Yoram Singer,et al.  Random Features for Compositional Kernels , 2017, ArXiv.

[32]  Jun Wang,et al.  Random Features for Shift-Invariant Kernels with Moment Matching , 2017, AAAI.

[33]  Yves-Laurent Kom Samo,et al.  Generalized Spectral Kernels , 2015, 1506.02236.

[34]  Trevor Campbell,et al.  Automated Scalable Bayesian Inference via Hilbert Coresets , 2017, J. Mach. Learn. Res..

[35]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[36]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[37]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[38]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[39]  Tara N. Sainath,et al.  Kernel methods match Deep Neural Networks on TIMIT , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[41]  Le Song,et al.  A Kernel Statistical Test of Independence , 2007, NIPS.

[42]  Cameron Musco,et al.  Recursive Sampling for the Nystrom Method , 2016, NIPS.

[43]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[44]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[45]  Harish Karnick,et al.  Random Feature Maps for Dot Product Kernels , 2012, AISTATS.

[46]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[47]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[48]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[49]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[50]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[51]  Jeff G. Schneider,et al.  On the Error of Random Fourier Features , 2015, UAI.

[52]  Shahar Mendelson,et al.  On the Performance of Kernel Classes , 2003, J. Mach. Learn. Res..

[53]  Martin J. Wainwright,et al.  Randomized sketches for kernels: Fast and optimal non-parametric regression , 2015, ArXiv.