A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent

This article characterizes the exact asymptotics of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples $n$, their dimension $p$, and the dimension of feature space $N$ are all large and comparable. In this regime, the random RFF Gram matrix no longer converges to the well-known limiting Gaussian kernel matrix (as it does when $N \to \infty$ alone), but it still has a tractable behavior that is captured by our analysis. This analysis also provides accurate estimates of training and test regression errors for large $n,p,N$. Based on these estimates, a precise characterization of two qualitatively different phases of learning, including the phase transition between them, is provided; and the corresponding double descent test error curve is derived from this phase transition behavior. These results do not depend on strong assumptions on the data distribution, and they perfectly match empirical results on finite-dimensional real-world data sets.

[1]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[4]  Michael W. Mahoney,et al.  Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning , 2018, J. Mach. Learn. Res..

[5]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[6]  R. Couillet,et al.  Random Matrix Methods for Wireless Communications , 2011 .

[7]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[8]  Vinay Uday Prabhu Kannada-MNIST: A new handwritten digits dataset for the Kannada language , 2019, ArXiv.

[9]  Michael W. Mahoney,et al.  Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks , 2019, SDM.

[10]  Surya Ganguli,et al.  Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[11]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[12]  Michael W. Mahoney,et al.  Traditional and Heavy-Tailed Self Regularization in Neural Network Models , 2019, ICML.

[13]  Zhenyu Liao,et al.  The Dynamics of Learning: A Random Matrix Approach , 2018, ICML.

[14]  Michael W. Mahoney,et al.  Exact expressions for double descent and implicit regularization via surrogate random design , 2019, NeurIPS.

[15]  Romain Couillet,et al.  Concentration of Measure and Large Random Matrices with an application to Sample Covariance Matrices , 2018, 1805.08295.

[16]  Lucas Benigni,et al.  Eigenvalue distribution of nonlinear models of random matrices , 2019, ArXiv.

[17]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[18]  Michael W. Mahoney,et al.  Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior , 2017, ArXiv.

[19]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[20]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[21]  D. Haussler,et al.  Rigorous learning curve bounds from statistical mechanics , 1994, COLT '94.

[22]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[23]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[24]  S. Kak Information, physics, and computation , 1996 .

[25]  Roy D. Yates,et al.  A Framework for Uplink Power Control in Cellular Radio Systems , 1995, IEEE J. Sel. Areas Commun..

[26]  Romain Couillet,et al.  Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures , 2020, ICML.

[27]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[28]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[29]  Ameya Velingker,et al.  Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[30]  Christopher K. I. Williams Computing with Infinite Networks , 1996, NIPS.

[31]  Christos Thrampoulidis,et al.  A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, ArXiv.

[32]  Surya Ganguli,et al.  The Emergence of Spectral Universality in Deep Networks , 2018, AISTATS.

[33]  W. Hachem,et al.  Deterministic equivalents for certain functionals of large random matrices , 2005, math/0507172.

[34]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[35]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[36]  Zhichao Wang,et al.  Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks , 2020, NeurIPS.

[37]  Surya Ganguli,et al.  Statistical Mechanics of Deep Learning , 2020, Annual Review of Condensed Matter Physics.

[38]  Romain Couillet,et al.  Large Sample Covariance Matrices of Concentrated Vectors , 2018 .

[39]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[40]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[41]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[42]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[43]  Lorenzo Rosasco,et al.  Generalization Properties of Learning with Random Features , 2016, NIPS.

[44]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[45]  Stefan Wager,et al.  High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.

[46]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[47]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[48]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[49]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[50]  T. Watkin,et al.  THE STATISTICAL-MECHANICS OF LEARNING A RULE , 1993 .

[51]  L. Pastur On Random Matrices Arising in Deep Neural Networks. Gaussian Case , 2020, 2001.06188.

[52]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[53]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[54]  Zhenyu Liao,et al.  On the Spectrum of Random Features Maps of High Dimensional Data , 2018, ICML.

[55]  M. Ledoux The concentration of measure phenomenon , 2001 .

[56]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[57]  Michael W. Mahoney,et al.  Statistical Mechanics Methods for Discovering Knowledge from Modern Production Quality Neural Networks , 2019, KDD.

[58]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.