Trainability and Accuracy of Artificial Neural Networks: An Interacting Particle System Approach

Neural networks, a central tool in machine learning, have demonstrated remarkable, high fidelity performance on image recognition and classification tasks. These successes evince an ability to accurately represent high dimensional functions, but rigorous results about the approximation error of neural networks after training are few. Here we establish conditions for global convergence of the standard optimization algorithm used in machine learning applications, stochastic gradient descent (SGD), and quantify the scaling of its error with the size of the network. This is done by reinterpreting SGD as the evolution of a particle system with interactions governed by a potential related to the objective or "loss" function used to train the network. We show that, when the number $n$ of units is large, the empirical distribution of the particles descends on a convex landscape towards the global minimum at a rate independent of $n$, with a resulting approximation error that universally scales as $O(n^{-1})$. These properties are established in the form of a Law of Large Numbers and a Central Limit Theorem for the empirical distribution. Our analysis also quantifies the scale and nature of the noise introduced by SGD and provides guidelines for the step size and batch size to use when training a neural network. We illustrate our findings on examples in which we train neural networks to learn the energy function of the continuous 3-spin model on the sphere. The approximation error scales as our analysis predicts in as high a dimension as $d=25$.

[1]  H. McKean,et al.  A CLASS OF MARKOV PROCESSES ASSOCIATED WITH NONLINEAR PARABOLIC EQUATIONS , 1966, Proceedings of the National Academy of Sciences of the United States of America.

[2]  M. Hp A class of markov processes associated with nonlinear parabolic equations. , 1966 .

[3]  J. Gärtner,et al.  Large deviations from the mckean-vlasov limit for weakly interacting diffusions , 1987 .

[4]  J. Gärtner On the McKean‐Vlasov Limit for Interacting Diffusions , 1988 .

[5]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[6]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[7]  A. Sznitman Topics in propagation of chaos , 1991 .

[8]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[9]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[10]  D. Dean LETTER TO THE EDITOR: Langevin equation for the density of a system of interacting Langevin processes , 1996, cond-mat/9611104.

[11]  C. Landim,et al.  Scaling Limits of Interacting Particle Systems , 1998 .

[12]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[13]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[14]  Michele Parrinello,et al.  Generalized neural-network representation of high-dimensional potential-energy surfaces. , 2007, Physical review letters.

[15]  C. Villani Optimal Transport: Old and New , 2008 .

[16]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[17]  Antonio Auffinger,et al.  Complexity of random smooth functions on the high-dimensional sphere , 2011, 1110.5872.

[18]  Sylvia Serfaty,et al.  Large deviation principle for empirical fields of Log and Riesz gases , 2015, Inventiones mathematicae.

[19]  Sylvia Serfaty,et al.  Coulomb Gases and Ginzburg - Landau Vortices , 2014, 1403.6860.

[20]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[21]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[22]  E Weinan,et al.  Deep Learning-Based Numerical Methods for High-Dimensional Parabolic Partial Differential Equations and Backward Stochastic Differential Equations , 2017, Communications in Mathematics and Statistics.

[23]  Mark E Tuckerman,et al.  Stochastic Neural Network Approach for Learning High-Dimensional Free Energy Surfaces. , 2017, Physical review letters.

[24]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[25]  S. Serfaty SYSTEMS OF POINTS WITH COULOMB INTERACTIONS , 2017, Proceedings of the International Congress of Mathematicians (ICM 2018).

[26]  Linfeng Zhang,et al.  DeePCG: Constructing coarse-grained models via deep neural networks. , 2018, The Journal of chemical physics.

[27]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[28]  Lexing Ying,et al.  Solving for high-dimensional committor functions using artificial neural networks , 2018, Research in the Mathematical Sciences.

[29]  Kaj Nyström,et al.  A unified deep artificial neural network approach to partial differential equations in complex geometries , 2017, Neurocomputing.

[30]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[31]  Yann LeCun,et al.  Comparing dynamics: deep neural networks versus glassy systems , 2018, ICML.

[32]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.

[33]  E Weinan,et al.  Machine Learning Approximation Algorithms for High-Dimensional Fully Nonlinear Partial Differential Equations and Second-order Backward Stochastic Differential Equations , 2017, J. Nonlinear Sci..