A mean field view of the landscape of two-layer neural networks

Significance Multilayer neural networks have proven extremely successful in a variety of tasks, from image classification to robotics. However, the reasons for this practical success and its precise domain of applicability are unknown. Learning a neural network from data requires solving a complex optimization problem with millions of variables. This is done by stochastic gradient descent (SGD) algorithms. We study the case of two-layer networks and derive a compact description of the SGD dynamics in terms of a limiting partial differential equation. Among other consequences, this shows that SGD dynamics does not become more complex when the network size increases. Multilayer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires optimizing a nonconvex high-dimensional objective (risk function), a problem that is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the former case, does this happen because local minima are absent or because SGD somehow avoids them? In the latter, why do local minima reached by SGD have good generalization properties? In this paper, we consider a simple case, namely two-layer neural networks, and prove that—in a suitable scaling limit—SGD dynamics is captured by a certain nonlinear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.

[1]  G. M.,et al.  Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[2]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[3]  A. Sznitman Topics in propagation of chaos , 1991 .

[4]  École d'été de probabilités de Saint-Flour,et al.  Ecole d'été de probabilités de Saint-Flour XIX, 1989 , 1991 .

[5]  O. A. Ladyzhenskai︠a︡,et al.  Linear and Quasi-linear Equations of Parabolic Type , 1995 .

[6]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[7]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[8]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[9]  M. Mézard,et al.  Thermodynamics of glasses: a first principles computation , 1998, cond-mat/9807420.

[10]  C. Villani,et al.  Kinetic equilibration rates for granular media and related equations: entropy dissipation and mass transportation estimates , 2003 .

[11]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[12]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[13]  C. Villani,et al.  Contractions in the 2-Wasserstein Length Space and Thermalization of Granular Media , 2006 .

[14]  L. Ambrosio,et al.  Chapter 1 – Gradient Flows of Probability Measures , 2007 .

[15]  H. Robbins A Stochastic Approximation Method , 1951 .

[16]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[17]  J. Carrillo,et al.  Global-in-time weak measure solutions and finite-time aggregation for nonlocal interaction equations , 2011 .

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[20]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[21]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[22]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[23]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[24]  B. Mityagin The Zero Set of a Real Analytic Function , 2015, Mathematical Notes.

[25]  Yuchen Zhang,et al.  L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.

[26]  Yuandong Tian,et al.  Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity , 2017, ICLR.

[27]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[28]  Yue M. Lu,et al.  Scaling Limit: Exact and Tractable Analysis of Online Learning Algorithms with Applications to Regularized Regression and PCA , 2017, ArXiv.

[29]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[30]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[31]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[32]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[33]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[34]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[35]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[36]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[37]  Sakinah,et al.  Vol. , 2020, New Medit.

[38]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..