Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks

We derive analytical expressions for the generalization performance of kernel regression as a function of the number of training samples using theoretical methods from Gaussian processes and statistical physics. Our expressions apply to wide neural networks due to an equivalence between training them and kernel regression with the Neural Tangent Kernel (NTK). By computing the decomposition of the total generalization error due to different spectral components of the kernel, we identify a new spectral principle: as the size of the training set grows, kernel machines and neural networks fit successively higher spectral modes of the target function. When data are sampled from a uniform distribution on a high-dimensional hypersphere, dot product kernels, including NTK, exhibit learning stages where different frequency modes of the target function are learned. We verify our theory with simulations on synthetic data and MNIST dataset.

[1]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[2]  Zheng Ma,et al.  Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , 2019, Communications in Computational Physics.

[3]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[4]  Zheng Ma,et al.  Explicitizing an Implicit Bias of the Frequency Principle in Two-layer Neural Networks , 2019, ArXiv.

[5]  Tomaso A. Poggio,et al.  Regularization Networks and Support Vector Machines , 2000, Adv. Comput. Math..

[6]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[7]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[8]  Jaehoon Lee,et al.  Neural Tangents: Fast and Easy Infinite Neural Networks in Python , 2019, ICLR.

[9]  Zohar Ringel,et al.  Learning Curves for Deep Neural Networks: A Gaussian Field Theory Perspective , 2019, ArXiv.

[10]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[11]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[12]  Greg Yang,et al.  A Fine-Grained Spectral Perspective on Neural Networks , 2019, ArXiv.

[13]  HighWire Press Philosophical transactions of the Royal Society of London. Series A, Containing papers of a mathematical or physical character , 1896 .

[14]  Matthieu Wyart,et al.  Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm , 2019, ArXiv.

[15]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[16]  S. Kirkpatrick,et al.  Solvable Model of a Spin-Glass , 1975 .

[17]  M. Opper,et al.  Statistical mechanics of Support Vector networks. , 1998, cond-mat/9811421.

[18]  Felipe Cucker,et al.  Best Choices for Regularization Parameters in Learning Theory: On the Bias—Variance Problem , 2002, Found. Comput. Math..

[19]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[20]  Julien Mairal,et al.  On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[21]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[22]  G. Arfken Mathematical Methods for Physicists , 1967 .

[23]  G. Wahba Spline models for observational data , 1990 .

[24]  Yoshua Bengio,et al.  On the Spectral Bias of Neural Networks , 2018, ICML.

[25]  M. Mézard,et al.  Spin Glass Theory And Beyond: An Introduction To The Replica Method And Its Applications , 1986 .

[26]  Alexander J. Smola,et al.  Regularization with Dot-Product Kernels , 2000, NIPS.

[27]  Farhan Ali,et al.  Flexibility in motor timing constrains the topology and dynamics of pattern generator circuits , 2015, bioRxiv.

[28]  Zheng Ma,et al.  Theory of the Frequency Principle for General Deep Neural Networks , 2019, CSIAM Transactions on Applied Mathematics.

[29]  Manfred Opper,et al.  General Bounds on Bayes Errors for Regression with Gaussian Processes , 1998, NIPS.

[30]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[31]  C. Frye,et al.  Spherical Harmonics in p Dimensions , 2012, 1205.3548.

[32]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[33]  Peter Sollich Gaussian Process Regression with Mismatched Models , 2001, NIPS.

[34]  Zhi-Qin John Xu,et al.  Training behavior of deep neural network in frequency domain , 2018, ICONIP.

[35]  Christian Van den Broeck,et al.  Statistical Mechanics of Learning , 2001 .

[36]  A SYMPTOTIC LEARNING CURVES OF KERNEL METHODS : EMPIRICAL DATA , 2019 .

[37]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[38]  Peter Sollich,et al.  Learning Curves for Gaussian Processes , 1998, NIPS.

[39]  Yuan Xu,et al.  Approximation Theory and Harmonic Analysis on Spheres and Balls , 2013 .

[40]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[41]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[42]  J. Hubbard Calculation of Partition Functions , 1959 .

[43]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[44]  Yuan Cao,et al.  Towards Understanding the Spectral Bias of Deep Learning , 2019, IJCAI.

[45]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[46]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[47]  Peter Sollich,et al.  Learning Curves for Gaussian Process Regression: Approximations and Bounds , 2001, Neural Computation.

[48]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[49]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[50]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[51]  Giorgio Parisi,et al.  Infinite Number of Order Parameters for Spin-Glasses , 1979 .

[52]  S. Ganguli,et al.  Statistical mechanics of complex neural systems and high dimensional data , 2013, 1301.7115.

[53]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[54]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..