The Geometry of Random Features

We present an in-depth examination of the effectiveness of radial basis function kernel (beyond Gaussian) estimators based on orthogonal random feature maps. We show that orthogonal estimators outperform state-of-the-art mechanisms that use iid sampling under weak conditions for tails of the associated Fourier distributions. We prove that for the case of many dimensions, the superiority of the orthogonal transform can be accurately measured by a property we define called the charm of the kernel, and that orthogonal random features provide optimal (in terms of mean squared error) kernel estimators. We provide the first theoretical results which explain why orthogonal random features outperform unstructured on downstream tasks such as kernel ridge regression by showing that orthogonal random features provide kernel algorithms with better spectral properties than the previous state-of-the-art. Our results enable practitioners more generally to estimate the benefits from applying orthogonal transforms.

[1]  Aicke Hinrichs,et al.  Johnson‐Lindenstrauss lemma for circulant matrices* * , 2010, Random Struct. Algorithms.

[2]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[3]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[4]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[5]  Richard E. Turner,et al.  The Multivariate Generalised von Mises Distribution: Inference and Applications , 2016, AAAI.

[6]  Alexandr Andoni,et al.  Practical and Optimal LSH for Angular Distance , 2015, NIPS.

[7]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[8]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[9]  David P. Woodruff,et al.  Faster Kernel Ridge Regression Using Sketching and Preconditioning , 2016, SIAM J. Matrix Anal. Appl..

[10]  Hui Zhang,et al.  New bounds for circulant Johnson-Lindenstrauss embeddings , 2013, ArXiv.

[11]  I. J. Schoenberg Metric spaces and completely monotone functions , 1938 .

[12]  Krzysztof Choromanski,et al.  Recycling Randomness with Structure for Sublinear time Kernel Expansions , 2016, ICML.

[13]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[14]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.

[15]  Anne Morvan,et al.  Structured adaptive and random spinners for fast machine learning computations , 2016, AISTATS.

[16]  Jan Vyb'iral A variant of the Johnson-Lindenstrauss lemma for circulant matrices , 2010, 1002.2847.

[17]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[18]  Dirk Ormoneit,et al.  Kernel-Based Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[19]  Ameya Velingker,et al.  Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[20]  Jeff G. Schneider,et al.  On the Error of Random Fourier Features , 2015, UAI.

[21]  Krzysztof Choromanski,et al.  The Unreasonable Effectiveness of Structured Random Orthogonal Embeddings , 2017, NIPS.

[22]  Sanjiv Kumar,et al.  Binary embeddings with structured hashed projections , 2015, ICML.

[23]  Sanjiv Kumar,et al.  Orthogonal Random Features , 2016, NIPS.

[24]  Martin J. Wainwright,et al.  Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..