Quantitative Propagation of Chaos for SGD in Wide Neural Networks

In this paper, we investigate the limiting behavior of a continuous-time counterpart of the Stochastic Gradient Descent (SGD) algorithm applied to two-layer overparameterized neural networks, as the number or neurons (ie, the size of the hidden layer) $N \to +\infty$. Following a probabilistic approach, we show 'propagation of chaos' for the particle system defined by this continuous-time dynamics under different scenarios, indicating that the statistical interaction between the particles asymptotically vanishes. In particular, we establish quantitative convergence with respect to $N$ of any particle to a solution of a mean-field McKean-Vlasov equation in the metric space endowed with the Wasserstein distance. In comparison to previous works on the subject, we consider settings in which the sequence of stepsizes in SGD can potentially depend on the number of neurons and the iterations. We then identify two regimes under which different mean-field limits are obtained, one of them corresponding to an implicitly regularized version of the minimization problem at hand. We perform various experiments on real datasets to validate our theoretical results, assessing the existence of these two regimes on classification problems and illustrating our convergence results.

[1]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[2]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[3]  D. W. Stroock,et al.  Multidimensional Diffusion Processes , 1979 .

[4]  J. Kent Time-reversible diffusions , 1978 .

[5]  A. Gottlieb Markov Transitions and the Propagation of Chaos , 2000, math/0001076.

[6]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[7]  Surya Ganguli,et al.  On the saddle point problem for non-convex optimization , 2014, ArXiv.

[8]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[9]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[10]  A. Sznitman Topics in propagation of chaos , 1991 .

[11]  Joan Bruna,et al.  Spurious Valleys in One-hidden-layer Neural Network Optimization Landscapes , 2019, J. Mach. Learn. Res..

[12]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[13]  Richard Mateosian,et al.  Old and New , 2006, IEEE Micro.

[14]  Sylvie Méléard,et al.  Systeme de particules et mesures-martingales: Un theoreme de propagation du chaos , 1988 .

[15]  A. Kechris Classical descriptive set theory , 1987 .

[16]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[17]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[18]  Yuanzhi Li,et al.  On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.

[19]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[20]  Joan Bruna,et al.  Neural Networks with Finite Intrinsic Dimension have no Spurious Valleys , 2018, ArXiv.

[21]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[22]  Adel Javanmard,et al.  Analysis of a Two-Layer Neural Network via Displacement Convexity , 2019, The Annals of Statistics.

[23]  Matthias Erbar The heat equation on manifolds as a gradient flow in the Wasserstein space , 2010 .

[24]  Florent Krzakala,et al.  Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model , 2019, NeurIPS.

[25]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[26]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[27]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[28]  L. Ambrosio,et al.  A User’s Guide to Optimal Transport , 2013 .

[29]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[30]  Ioannis Karatzas,et al.  Brownian Motion and Stochastic Calculus , 1987 .

[31]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[32]  Lénaïc Chizat Sparse optimization on measures with over-parameterized gradient descent , 2019, Mathematical Programming.

[33]  Sanjeev Arora,et al.  Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets , 2019, NeurIPS.

[34]  Valentin De Bortoli,et al.  Continuous and Discrete-Time Analysis of Stochastic Gradient Descent for Convex and Non-Convex Functions. , 2020, 2004.04193.

[35]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[36]  L. Szpruch,et al.  Mean-Field Neural ODEs via Relaxed Optimal Control , 2019, 1912.05475.

[37]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[38]  L. Ambrosio,et al.  Existence and stability for Fokker–Planck equations with log-concave reference measure , 2007, Probability Theory and Related Fields.

[39]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[40]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[41]  Grant M. Rotskoff,et al.  Trainability and Accuracy of Artificial Neural Networks: An Interacting Particle System Approach , 2018, Communications on Pure and Applied Mathematics.

[42]  A. Bray,et al.  Statistics of critical points of Gaussian fields on large-dimensional spaces. , 2006, Physical review letters.

[43]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[44]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[45]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[46]  F. Bonsall,et al.  Lectures on some fixed point theorems of functional analysis , 1962 .