Double-descent curves in neural networks: a new perspective using Gaussian processes

Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterized regime. In this paper, we use techniques from random matrix theory to characterize the spectral distribution of the empirical feature covariance matrix as a width-dependent perturbation of the spectrum of the neural network Gaussian process (NNGP) kernel, thus establishing a novel connection between the NNGP literature and the random matrix theory literature in the context of neural networks. Our analytical expression allows us to study the generalisation behavior of the corresponding kernel and GP regression, and provides a new interpretation of the double-descent phenomenon, namely as governed by the discrepancy between the width-dependent empirical kernel and the width-independent NNGP kernel.

[1]  James B. Simon,et al.  Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting , 2022, ArXiv.

[2]  Yue M. Lu,et al.  An Equivalence Principle for the Spectrum of Random Inner-Product Kernel Matrices with Polynomial Scalings , 2022, 2205.06308.

[3]  Hayden Schaeffer,et al.  Conditioning of Random Feature Matrices: Double Descent and Generalization Error , 2021, ArXiv.

[4]  J. Suykens,et al.  On the Double Descent of Random Features Models Trained with SGD , 2021, NeurIPS.

[5]  James B. Simon,et al.  The Eigenlearning Framework: A Conservation Law Perspective on Kernel Regression and Wide Neural Networks , 2021, 2110.03922.

[6]  Zhi-Hua Zhou,et al.  Towards an Understanding of Benign Overfitting in Neural Networks , 2021, ArXiv.

[7]  F. Krzakala,et al.  Generalization error rates in kernel regression: the crossover from the noiseless to noisy regime , 2021, NeurIPS.

[8]  Mikhail Belkin,et al.  Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation , 2021, Acta Numerica.

[9]  O. Zeitouni,et al.  Lower Bounds on the Generalization Error of Nonlinear Learning Models , 2021, IEEE Transactions on Information Theory.

[10]  Masaaki Imaizumi,et al.  Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks , 2021, ArXiv.

[11]  Matthieu Wyart,et al.  Perspective: A Phase Diagram for Deep Learning unifying Jamming, Feature Learning and Lazy Training , 2020, ArXiv.

[12]  Haim Sompolinsky,et al.  Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization , 2020, Physical Review X.

[13]  Ard A. Louis,et al.  Generalization bounds for deep learning , 2020, ArXiv.

[14]  Jeffrey Pennington,et al.  Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition , 2020, NeurIPS.

[15]  Zhenyu Liao,et al.  Kernel regression in high dimension: Refined analysis beyond double descent , 2020, AISTATS.

[16]  Jaehoon Lee,et al.  Finite Versus Infinite Neural Networks: an Empirical Study , 2020, NeurIPS.

[17]  Jeffrey Pennington,et al.  The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization , 2020, ICML.

[18]  Guillermo Valle Pérez,et al.  Is SGD a Bayesian sampler? Well, almost , 2020, J. Mach. Learn. Res..

[19]  Andrea Montanari,et al.  When do neural networks outperform kernel methods? , 2020, NeurIPS.

[20]  C. Pehlevan,et al.  Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks , 2020, Nature Communications.

[21]  Arthur Jacot,et al.  Kernel Alignment Risk Estimator: Risk Prediction from Training Data , 2020, NeurIPS.

[22]  Zhenyu Liao,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[23]  Z. Fan,et al.  Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks , 2020, NeurIPS.

[24]  Sundeep Rangan,et al.  Generalization Error of Generalized Linear Models in High Dimensions , 2020, ICML.

[25]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[26]  Florent Krzakala,et al.  Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.

[27]  Pavel Izmailov,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[28]  Florent Krzakala,et al.  Asymptotic errors for convex penalized linear regression beyond Gaussian matrices. , 2020, 2002.04372.

[29]  Blake Bordelon,et al.  Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks , 2020, ICML.

[30]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[31]  Greg Yang,et al.  Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes , 2019, NeurIPS.

[32]  Yuan Cao,et al.  Towards Understanding the Spectral Bias of Deep Learning , 2019, IJCAI.

[33]  Guillermo Valle Pérez,et al.  Neural networks are a priori biased towards Boolean functions with low entropy , 2019, ArXiv.

[34]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[35]  Nicolas Boumal,et al.  Efficiently escaping saddle points on manifolds , 2019, NeurIPS.

[36]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[37]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[38]  Philip M. Long,et al.  On the Effect of the Activation Function on the Distribution of Hidden Nodes in a Deep Network , 2019, Neural Computation.

[39]  Levent Sagun,et al.  Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[40]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[41]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[42]  Laurence Aitchison,et al.  Deep Convolutional Networks as shallow Gaussian Processes , 2018, ICLR.

[43]  Chico Q. Camargo,et al.  Deep learning generalizes because the parameter-function map is biased towards simple functions , 2018, ICLR.

[44]  Chico Q. Camargo,et al.  Input–output maps are strongly biased towards simple outputs , 2018, Nature Communications.

[45]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[46]  Jeffrey Pennington,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[47]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.

[48]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[49]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[50]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[51]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[52]  A. Montanari,et al.  The spectral norm of random inner-product kernel matrices , 2015, 1507.05343.

[53]  M. Peligrad,et al.  On the limiting spectral distribution for a large class of symmetric random matrices with correlated entries , 2015 .

[54]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[55]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[56]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[57]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[58]  Ard A. Louis,et al.  The Arrival of the Frequent: How Bias in Genotype-Phenotype Maps Can Steer Populations to Local Optima , 2014, PloS one.

[59]  Michael M. Bronstein,et al.  Almost-commuting matrices are almost jointly diagonalizable , 2013, ArXiv.

[60]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[61]  Xiuyuan Cheng,et al.  THE SPECTRUM OF RANDOM INNER-PRODUCT KERNEL MATRICES , 2012, 1202.3155.

[62]  Noureddine El Karoui,et al.  The spectrum of kernel random matrices , 2010, 1001.0492.

[63]  Jihnhee Yu,et al.  Measures, Integrals and Martingales , 2007, Technometrics.

[64]  Yuan Yao,et al.  Mercer's Theorem, Feature Maps, and Smoothing , 2006, COLT.

[65]  Theodore P. Hill,et al.  Necessary and sufficient condition that the limit of Stieltjes transforms is a Stieltjes transform , 2003, J. Approx. Theory.

[66]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[67]  Peter Sollich,et al.  Learning Curves for Gaussian Process Regression: Approximations and Bounds , 2001, Neural Computation.

[68]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[69]  C. Tracy,et al.  Introduction to Random Matrices , 1992, hep-th/9210073.

[70]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[71]  F. Vallet,et al.  Linear and Nonlinear Extension of the Pseudo-Inverse Solution for Learning Boolean Functions , 1989 .

[72]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[73]  E. Wigner Characteristic Vectors of Bordered Matrices with Infinite Dimensions I , 1955 .

[74]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[75]  Ayça Özçelikkale,et al.  Double Descent in Random Feature Models: Precise Asymptotic Analysis for General Convex Regularization , 2022, ArXiv.

[76]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[77]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[78]  G. Micula,et al.  Numerical Treatment of the Integral Equations , 1999 .

[79]  Radford M. Neal Bayesian learning for neural networks , 1995 .

[80]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .