A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks

We develop a mathematically rigorous framework for multilayer neural networks in the mean field regime. As the network's width increases, the network's learning trajectory is shown to be well captured by a meaningful and dynamically nonlinear limit (the \textit{mean field} limit), which is characterized by a system of ODEs. Our framework applies to a broad range of network architectures, learning dynamics and network initializations. Central to the framework is the new idea of a \textit{neuronal embedding}, which comprises of a non-evolving probability space that allows to embed neural networks of arbitrary widths. We demonstrate two applications of our framework. Firstly the framework gives a principled way to study the simplifying effects that independent and identically distributed initializations have on the mean field limit. Secondly we prove a global convergence guarantee for two-layer and three-layer networks. Unlike previous works that rely on convexity, our result requires a certain universal approximation property, which is a distinctive feature of infinite-width neural networks. To the best of our knowledge, this is the first time global convergence is established for neural networks of more than two layers in the mean field regime.

[1]  Justin A. Sirignano,et al.  Mean Field Analysis of Deep Neural Networks , 2019, Math. Oper. Res..

[2]  Marco Mondelli,et al.  Landscape Connectivity and Dropout Stability of SGD Solutions for Over-parameterized Neural Networks , 2019, ICML.

[3]  Adel Javanmard,et al.  Analysis of a Two-Layer Neural Network via Displacement Convexity , 2019, The Annals of Statistics.

[4]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[5]  R. Oliveira,et al.  A mean-field limit for certain deep neural networks , 2019, 1906.00193.

[6]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[7]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[8]  Phan-Minh Nguyen,et al.  Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks , 2019, ArXiv.

[9]  Joan Bruna,et al.  Global convergence of neuron birth-death dynamics , 2019, ICML 2019.

[10]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[11]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[12]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[13]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[14]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[15]  Jan Vondrák,et al.  Generalization Bounds for Uniformly Stable Algorithms , 2018, NeurIPS.

[16]  Qiang Liu,et al.  On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.

[17]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[18]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[19]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[20]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[21]  Taiji Suzuki,et al.  Stochastic Particle Gradient Descent for Infinite Ensembles , 2017, ArXiv.

[22]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[23]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[24]  Hong Chen,et al.  Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems , 1995, IEEE Trans. Neural Networks.

[25]  I. Pinelis OPTIMUM BOUNDS FOR THE DISTRIBUTIONS OF MARTINGALES IN BANACH SPACES , 1994, 1208.2200.

[26]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[27]  I. Pinelis,et al.  Remarks on Inequalities for Large Deviation Probabilities , 1986 .