Analysis of a Two-Layer Neural Network via Displacement Convexity

Fitting a function by using linear combinations of a large number $N$ of `simple' components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches. Here we consider the problem of learning a concave function $f$ on a compact convex domain $\Omega\subseteq {\mathbb R}^d$, using linear combinations of `bump-like' components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization problem is highly non-convex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over $\Omega$. Further, when the bump width $\delta$ tends to $0$, this gradient flow has a limit which is a viscous porous medium equation. Remarkably, the cost function optimized by this gradient flow exhibits a special property known as displacement convexity, which implies exponential convergence rates for $N\to\infty$, $\delta\to 0$. Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of $\delta, N$. Explaining this phenomenon, and understanding the dependence on $\delta,N$ in a quantitative manner remains an outstanding challenge.

[1]  David P. Woodruff,et al.  Learning Two Layer Rectified Neural Networks in Polynomial Time , 2018, COLT.

[2]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[3]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[4]  Yuandong Tian,et al.  Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity , 2017, ICLR.

[5]  C. Villani Optimal Transport: Old and New , 2008 .

[6]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[7]  Hiroshi Tanaka Stochastic differential equations with reflecting boundary condition in convex regions , 1979 .

[8]  David B. Dunson,et al.  Multivariate convex regression with adaptive partitioning , 2011, J. Mach. Learn. Res..

[9]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[10]  Karl Oelschläger,et al.  Simulation of the Solution of a Viscous Porous Medium Equation by a Particle Method , 2002, SIAM J. Numer. Anal..

[11]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[12]  David Williams Diffusions, Markov Processes and Martingales: Volume 2, Ito Calculus , 2000 .

[13]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[14]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[15]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[16]  G. M.,et al.  Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[17]  C. Chou The Vlasov equations , 1965 .

[18]  École d'été de probabilités de Saint-Flour,et al.  Ecole d'été de probabilités de Saint-Flour XIX, 1989 , 1991 .

[19]  Qiang Liu,et al.  On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.

[20]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[21]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[22]  Michael Taylor,et al.  Partial Differential Equations I: Basic Theory , 1996 .

[23]  C. Villani,et al.  Contractions in the 2-Wasserstein Length Space and Thermalization of Granular Media , 2006 .

[24]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[25]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[26]  M. Ledoux,et al.  Analysis and Geometry of Markov Diffusion Operators , 2013 .

[27]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[28]  L. Slominski Euler's approximations of solutions of SDEs with reflecting boundary , 2001 .

[29]  Taiji Suzuki,et al.  Stochastic Particle Gradient Descent for Infinite Ensembles , 2017, ArXiv.

[30]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[31]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[32]  J. W. Thomas Numerical Partial Differential Equations: Finite Difference Methods , 1995 .

[33]  C. Villani,et al.  Kinetic equilibration rates for granular media and related equations: entropy dissipation and mass transportation estimates , 2003 .

[34]  O. A. Ladyzhenskai︠a︡,et al.  Linear and Quasi-linear Equations of Parabolic Type , 1995 .

[35]  Emmanuel J. Candès,et al.  Towards a Mathematical Theory of Super‐resolution , 2012, ArXiv.

[36]  P. Lions,et al.  Stochastic differential equations with reflecting boundary conditions , 1984 .

[37]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[38]  Lixin Yan,et al.  Gradient estimate on convex domains and applications , 2012 .

[39]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[40]  Alessio Figalli,et al.  Convergence to the viscous porous medium equa- tion and propagation of chaos , 2008 .

[41]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[42]  R. Samworth,et al.  Generalized additive and index models with shape constraints , 2014, 1404.2957.

[43]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[44]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[45]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[46]  Pascal Cherrier,et al.  Linear and Quasi-Linear Evolution Equations in Hilbert Spaces , 2012 .

[47]  A. Sznitman Topics in propagation of chaos , 1991 .

[48]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[49]  R. McCann A Convexity Principle for Interacting Gases , 1997 .

[50]  Robert Philipowski Interacting diffusions approximating the porous medium equation and propagation of chaos , 2007 .

[51]  Leszek Słomiński On approximation of solutions of multidimensional SDE's with reflecting boundary conditions , 1994 .

[52]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[53]  Karl Oelschläger,et al.  A Sequence of Integro-Differential Equations Approximating a Viscous Porous Medium Equation , 2001 .

[54]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[55]  J. Graver,et al.  Graduate studies in mathematics , 1993 .

[56]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[57]  Ansgar Jüngel,et al.  Entropy Dissipation Methods for Degenerate ParabolicProblems and Generalized Sobolev Inequalities , 2001 .

[58]  J. Vázquez The Porous Medium Equation: Mathematical Theory , 2006 .

[59]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[60]  D. Donoho Superresolution via sparsity constraints , 1992 .

[61]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[62]  I. Bihari A generalization of a lemma of bellman and its application to uniqueness problems of differential equations , 1956 .

[63]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .