Wide neural networks: From non-gaussian random fields at initialization to the NTK geometry of training

Recent developments in applications of artificial neural networks with over $n=10^{14}$ parameters make it extremely important to study the large $n$ behaviour of such networks. Most works studying wide neural networks have focused on the infinite width $n \to +\infty$ limit of such networks and have shown that, at initialization, they correspond to Gaussian processes. In this work we will study their behavior for large, but finite $n$. Our main contributions are the following: (1) The computation of the corrections to Gaussianity in terms of an asymptotic series in $n^{-\frac{1}{2}}$. The coefficients in this expansion are determined by the statistics of parameter initialization and by the activation function. (2) Controlling the evolution of the outputs of finite width $n$ networks, during training, by computing deviations from the limiting infinite width case (in which the network evolves through a linear flow). This improves previous estimates and yields sharper decay rates for the (finite width) NTK in terms of $n$, valid during the entire training procedure. As a corollary, we also prove that, with arbitrarily high probability, the training of sufficiently wide neural networks converges to a global minimum of the corresponding quadratic loss function. (3) Estimating how the deviations from Gaussianity evolve with training in terms of $n$. In particular, using a certain metric in the space of measures we find that, along training, the resulting measure is within $n^{-\frac{1}{2}}(\log n)^{1+}$ of the time dependent Gaussian process corresponding to the infinite width network (which is explicitly given by precomposing the initial Gaussian process with the linear flow corresponding to training in the infinite width limit).

[1]  山崎 泰郎,et al.  Measures on infinite dimensional spaces , 2021, Mathematical Feynman Path Integrals and Their Applications.

[2]  Daniel A. Roberts,et al.  The Principles of Deep Learning Theory , 2021, ArXiv.

[3]  Ken-ichi Kawarabayashi,et al.  How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks , 2020, ICLR.

[4]  Grant M. Rotskoff,et al.  A Dynamical Central Limit Theorem for Shallow Neural Networks , 2020, NeurIPS.

[5]  Jaehoon Lee,et al.  Finite Versus Infinite Neural Networks: an Empirical Study , 2020, NeurIPS.

[6]  Paris Perdikaris,et al.  When and why PINNs fail to train: A neural tangent kernel perspective , 2020, J. Comput. Phys..

[7]  Andrea Montanari,et al.  When do neural networks outperform kernel methods? , 2020, NeurIPS.

[8]  Jiaoyang Huang,et al.  Dynamics of Deep Neural Networks and Neural Tangent Hierarchy , 2019, ICML.

[9]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[10]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[11]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[12]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[13]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[14]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[15]  Samet Oymak,et al.  Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.

[16]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[17]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[18]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[19]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[20]  Arthur Jacot,et al.  Neural Tangent Kernel: Convergence and Generalization in Neural Networks , 2018, NeurIPS.

[21]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[22]  Jeffrey Pennington,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[23]  R. Durrett Probability: Theory and Examples , 1993 .

[24]  E. Siebert Weak convergence of measures , 1984 .

[25]  Edward J. Hu,et al.  Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks , 2021, ICML.

[26]  A. Wills,et al.  Physics-informed machine learning , 2021, Nature Reviews Physics.

[27]  R. Bass,et al.  Review: P. Billingsley, Convergence of probability measures , 1971 .

[28]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[29]  Radford M. Neal Priors for Infinite Networks , 1996 .