论文信息 - Analysis of a Two-Layer Neural Network via Displacement Convexity - 字舞流文

Analysis of a Two-Layer Neural Network via Displacement Convexity

Fitting a function by using linear combinations of a large number $N$ of `simple' components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is non-convex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches. Here we consider the problem of learning a concave function $f$ on a compact convex domain $\Omega\subseteq {\mathbb R}^d$, using linear combinations of `bump-like' components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization problem is highly non-convex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over $\Omega$. Further, when the bump width $\delta$ tends to $0$, this gradient flow has a limit which is a viscous porous medium equation. Remarkably, the cost function optimized by this gradient flow exhibits a special property known as displacement convexity, which implies exponential convergence rates for $N\to\infty$, $\delta\to 0$. Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of $\delta, N$. Explaining this phenomenon, and understanding the dependence on $\delta,N$ in a quantitative manner remains an outstanding challenge.

Adel Javanmard | Andrea Montanari | Marco Mondelli | Adel Javanmard | A. Montanari | Marco Mondelli

[1] David P. Woodruff,et al. Learning Two Layer Rectified Neural Networks in Polynomial Time , 2018, COLT.

[2] Nicolas Le Roux,et al. Convex Neural Networks , 2005, NIPS.

[3] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[4] Yuandong Tian,et al. Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity , 2017, ICLR.

[5] C. Villani. Optimal Transport: Old and New , 2008 .

[6] J. Friedman. Greedy function approximation: A gradient boosting machine. , 2001 .

[7] Hiroshi Tanaka. Stochastic differential equations with reflecting boundary condition in convex regions , 1979 .

[8] David B. Dunson,et al. Multivariate convex regression with adaptive partitioning , 2011, J. Mach. Learn. Res..

[9] A. Montanari,et al. The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[10] Karl Oelschläger,et al. Simulation of the Solution of a Viscous Porous Medium Equation by a Particle Method , 2002, SIAM J. Numer. Anal..

[11] Francis R. Bach,et al. Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[12] David Williams. Diffusions, Markov Processes and Martingales: Volume 2, Ito Calculus , 2000 .

[13] Adel Javanmard,et al. Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[14] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[15] Andrew R. Barron,et al. Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[16] G. M.,et al. Partial Differential Equations I , 2023, Applied Mathematical Sciences.

[17] C. Chou. The Vlasov equations , 1965 .

[18] École d'été de probabilités de Saint-Flour,et al. Ecole d'été de probabilités de Saint-Flour XIX, 1989 , 1991 .

[19] Qiang Liu,et al. On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.

[20] P. Bühlmann,et al. Boosting with the L2-loss: regression and classification , 2001 .

[21] F. Santambrogio. Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[22] Michael Taylor,et al. Partial Differential Equations I: Basic Theory , 1996 .

[23] C. Villani,et al. Contractions in the 2-Wasserstein Length Space and Thermalization of Granular Media , 2006 .

[24] Justin A. Sirignano,et al. Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[25] Colin Wei,et al. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[26] M. Ledoux,et al. Analysis and Geometry of Markov Diffusion Operators , 2013 .

[27] Inderjit S. Dhillon,et al. Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[28] L. Slominski. Euler's approximations of solutions of SDEs with reflecting boundary , 2001 .

[29] Taiji Suzuki,et al. Stochastic Particle Gradient Descent for Infinite Ensembles , 2017, ArXiv.

[30] A. A. Mullin,et al. Principles of neurodynamics , 1962 .

[31] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[32] J. W. Thomas. Numerical Partial Differential Equations: Finite Difference Methods , 1995 .

[33] C. Villani,et al. Kinetic equilibration rates for granular media and related equations: entropy dissipation and mass transportation estimates , 2003 .

[34] O. A. Ladyzhenskai︠a︡,et al. Linear and Quasi-linear Equations of Parabolic Type , 1995 .

[35] Emmanuel J. Candès,et al. Towards a Mathematical Theory of Super‐resolution , 2012, ArXiv.

[36] P. Lions,et al. Stochastic differential equations with reflecting boundary conditions , 1984 .

[37] Andrea Montanari,et al. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[38] Lixin Yan,et al. Gradient estimate on convex domains and applications , 2012 .

[39] Jooyoung Park,et al. Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[40] Alessio Figalli,et al. Convergence to the viscous porous medium equa- tion and propagation of chaos , 2008 .

[41] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[42] R. Samworth,et al. Generalized additive and index models with shape constraints , 2014, 1404.2957.

[43] Nello Cristianini,et al. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[44] Konstantinos Spiliopoulos,et al. Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[45] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[46] Pascal Cherrier,et al. Linear and Quasi-Linear Evolution Equations in Hilbert Spaces , 2012 .

[47] A. Sznitman. Topics in propagation of chaos , 1991 .

[48] Francis Bach,et al. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[49] R. McCann. A Convexity Principle for Interacting Gases , 1997 .

[50] Robert Philipowski. Interacting diffusions approximating the porous medium equation and propagation of chaos , 2007 .

[51] Leszek Słomiński. On approximation of solutions of multidimensional SDE's with reflecting boundary conditions , 1994 .

[52] Grant M. Rotskoff,et al. Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[53] Karl Oelschläger,et al. A Sequence of Integro-Differential Equations Approximating a Viscous Porous Medium Equation , 2001 .

[54] Andrea Montanari,et al. A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[55] J. Graver,et al. Graduate studies in mathematics , 1993 .

[56] Peter L. Bartlett,et al. The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[57] Ansgar Jüngel,et al. Entropy Dissipation Methods for Degenerate ParabolicProblems and Generalized Sobolev Inequalities , 2001 .

[58] J. Vázquez. The Porous Medium Equation: Mathematical Theory , 2006 .

[59] Robert E. Schapire,et al. The Boosting Approach to Machine Learning An Overview , 2003 .

[60] D. Donoho. Superresolution via sparsity constraints , 1992 .

[61] L. Ambrosio,et al. Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[62] I. Bihari. A generalization of a lemma of bellman and its application to uniqueness problems of differential equations , 1956 .

[63] Thomas M. Cover,et al. Elements of Information Theory , 2005 .