Asymptotics of Wide Networks from Feynman Diagrams

Understanding the asymptotic behavior of wide networks is of considerable interest. In this work, we present a general method for analyzing this large width behavior. The method is an adaptation of Feynman diagrams, a standard tool for computing multivariate Gaussian integrals. We apply our method to study training dynamics, improving existing bounds and deriving new results on wide network evolution during stochastic gradient descent. Going beyond the strict large width limit, we present closed-form expressions for higher-order terms governing wide network training, and test these predictions empirically.

[1]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[2]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[3]  Radford M. Neal Priors for Infinite Networks , 1996 .

[4]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[5]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[6]  Jascha Sohl-Dickstein,et al.  Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks , 2018, ICML.

[7]  Matthieu Wyart,et al.  Disentangling feature and lazy learning in deep neural networks: an empirical study , 2019, ArXiv.

[8]  R. Feynman Space - time approach to quantum electrodynamics , 1949 .

[9]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[10]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[11]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[12]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[13]  Jeffrey Pennington,et al.  The Spectrum of the Fisher Information Matrix of a Single-Hidden-Layer Neural Network , 2018, NeurIPS.

[14]  Laurence Aitchison,et al.  Deep Convolutional Networks as shallow Gaussian Processes , 2018, ICLR.

[15]  Samuel S. Schoenholz,et al.  Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks , 2018, ICML.

[16]  Andrea Montanari,et al.  Limitations of Lazy Training of Two-layers Neural Networks , 2019, NeurIPS.

[17]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[18]  Levent Sagun,et al.  Scaling description of generalization with number of parameters in deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[19]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[20]  G. Hooft A Planar Diagram Theory for Strong Interactions , 1974 .

[21]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[22]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[23]  M. Nica,et al.  Products of Many Large Random Matrices and Gradients in Deep Neural Networks , 2018, Communications in Mathematical Physics.

[24]  Jiaoyang Huang,et al.  Dynamics of Deep Neural Networks and Neural Tangent Hierarchy , 2019, ICML.

[25]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[26]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[27]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[28]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[29]  Jascha Sohl-Dickstein,et al.  A Correspondence Between Random Neural Networks and Statistical Field Theory , 2017, ArXiv.

[30]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[31]  Vardan Papyan,et al.  Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians , 2019, ICML.

[32]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[33]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[34]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[35]  Surya Ganguli,et al.  Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice , 2017, NIPS.

[36]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.