The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.

[1]  Murat A. Erdogdu,et al.  High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , 2022, NeurIPS.

[2]  James Martens,et al.  Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers , 2022, ICLR.

[3]  Edward J. Hu,et al.  Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , 2022, ArXiv.

[4]  Samuel S. Schoenholz,et al.  Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping , 2021, ArXiv.

[5]  Daniel A. Roberts,et al.  The Principles of Deep Learning Theory , 2021, ArXiv.

[6]  Sebastian Nowozin,et al.  Precise characterization of the prior predictive distribution of deep ReLU networks , 2021, NeurIPS.

[7]  Daniel M. Roy,et al.  The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization , 2021, NeurIPS.

[8]  Jacob A. Zavatone-Veth,et al.  Asymptotics of representation learning in finite Bayesian neural networks , 2021, NeurIPS.

[9]  Jared Tanner,et al.  Activation function design for deep networks: linearity and effective initialisation , 2021, Applied and Computational Harmonic Analysis.

[10]  Cengiz Pehlevan,et al.  Exact marginal prior distributions of finite Bayesian neural networks , 2021, NeurIPS.

[11]  Andrea Montanari,et al.  Deep learning: a statistical viewpoint , 2021, Acta Numerica.

[12]  G. Kutyniok,et al.  Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory? , 2020, MSML.

[13]  Greg Yang,et al.  Feature Learning in Infinite-Width Neural Networks , 2020, ArXiv.

[14]  A. Doucet,et al.  Stable ResNet , 2020, AISTATS.

[15]  John Wright,et al.  Deep Networks and the Multiple Manifold Problem , 2020, ICLR.

[16]  Greg Yang,et al.  Tensor Programs II: Neural Tangent Kernel for Any Architecture , 2020, ArXiv.

[17]  Taiji Suzuki,et al.  Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.

[18]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[19]  Yuan Cao,et al.  How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , 2019, ICLR.

[20]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2019, ICLR.

[21]  Sho Yaida,et al.  Non-Gaussian processes and neural networks at finite widths , 2019, MSML.

[22]  Boris Hanin,et al.  Finite Depth and Width Corrections to the Neural Tangent Kernel , 2019, ICLR.

[23]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[24]  Arnaud Doucet,et al.  On the Impact of the Activation Function on Deep Neural Networks Training , 2019, ICML.

[25]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[26]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[27]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[28]  M. Nica,et al.  Products of Many Large Random Matrices and Gradients in Deep Neural Networks , 2018, Communications in Mathematical Physics.

[29]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[30]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[31]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[32]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[33]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[34]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[35]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[36]  Grant M. Rotskoff,et al.  Trainability and Accuracy of Artificial Neural Networks: An Interacting Particle System Approach , 2018, Communications on Pure and Applied Mathematics.

[37]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[38]  David Rolnick,et al.  How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.

[39]  Samuel S. Schoenholz,et al.  Mean Field Residual Networks: On the Edge of Chaos , 2017, NIPS.

[40]  Jeffrey Pennington,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[41]  Trevor Campbell,et al.  Automated Scalable Bayesian Inference via Hilbert Coresets , 2017, J. Mach. Learn. Res..

[42]  Andy R. Terrel,et al.  SymPy: Symbolic computing in Python , 2017, PeerJ Prepr..

[43]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[44]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[47]  S. Ethier,et al.  Markov Processes: Characterization and Convergence , 2005 .

[48]  B. Hanin Correlation Functions in Random Fully Connected Neural Networks at Finite Width , 2022, ArXiv.

[49]  O. Kallenberg Foundations of Modern Probability , 2021, Probability Theory and Stochastic Modelling.

[50]  Heng Huang,et al.  On the Random Conjugate Kernel and Neural Tangent Kernel , 2021, ICML.

[51]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[52]  W. Marsden I and J , 2012 .

[53]  Xiongzhi Chen Brownian Motion and Stochastic Calculus , 2008 .

[54]  Neil Genzlinger A. and Q , 2006 .

[55]  Martin Raič,et al.  Normal Approximation by Stein ’ s Method , 2003 .

[56]  Radford M. Neal Bayesian learning for neural networks , 1995 .

[57]  D. W. Stroock,et al.  Multidimensional Diffusion Processes , 1979 .