Can Shallow Neural Networks Beat the Curse of Dimensionality? A Mean Field Training Perspective

We prove that the gradient descent training of a two-layer neural network on empirical or population risk may not decrease population risk at an order faster than $t^{-4/(d-2)}$ under mean field scaling. The loss functional is mean squared error with a Lipschitz-continuous target function and data distributed uniformly on the $d$-dimensional unit cube. Thus gradient descent training for fitting reasonably smooth, but truly high-dimensional data may be subject to the curse of dimensionality. We present numerical evidence that gradient descent training with general Lipschitz target functions becomes slower and slower as the dimension increases, but converges at approximately the same rate in all dimensions when the target function lies in the natural function space for two-layer ReLU networks. Impact Statement–Artificial neural networks perform well in many real life applications, but may suffer from the curse of dimensionality on certain problems. We provide theoretical and numerical evidence that this may be related to whether a target function lies in the hypothesis class described by infinitely wide networks. The training dynamics are considered in the fully non-linear regime and not reduced to neural tangent kernels. We believe that it will be essential to study these hypothesis classes in detail to choose an appropriate machine learning models for a given problem. The goal of the article is to illustrate this in a mathematically sound and numerically convincing fashion.

[1]  O. Bagasra,et al.  Proceedings of the National Academy of Sciences , 1914, Science.

[2]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[3]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[4]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[5]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[6]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[7]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[8]  ScienceDirect,et al.  Comptes rendus. Mathématique , 2002 .

[9]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[10]  T. Hughes,et al.  Signals and systems , 2006, Genome Biology.

[11]  C. Villani Optimal Transport: Old and New , 2008 .

[12]  N. Stanietsky,et al.  The interaction of TIGIT with PVR and PVRL2 inhibits human NK cell cytotoxicity , 2009, Proceedings of the National Academy of Sciences.

[13]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[14]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[15]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[16]  E Weinan,et al.  Dynamics of Stochastic Gradient Algorithms , 2015, ArXiv.

[17]  Ran Raz,et al.  Fast Learning Requires Good Memory: A Time-Space Lower Bound for Parity Learning , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[18]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[19]  Emmanuel Abbe,et al.  Provable limitations of deep learning , 2018, ArXiv.

[20]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[21]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[22]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[23]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[24]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[25]  L. Berlyand,et al.  On the convergence of formally diverging neural net-based classifiers , 2018 .

[26]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.

[27]  Lei Wu,et al.  Barron Spaces and the Compositional Function Spaces for Neural Network Models , 2019, ArXiv.

[28]  R. Oliveira,et al.  A mean-field limit for certain deep neural networks , 2019, 1906.00193.

[29]  Phan-Minh Nguyen,et al.  Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks , 2019, ArXiv.

[30]  Arthur Gretton,et al.  Maximum Mean Discrepancy Gradient Flow , 2019, NeurIPS.

[31]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[32]  Lei Wu,et al.  Machine learning from a continuous viewpoint, I , 2019, Science China Mathematics.

[33]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[34]  Justin A. Sirignano,et al.  Mean Field Analysis of Deep Neural Networks , 2019, Math. Oper. Res..

[35]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[36]  E. Weinan,et al.  Kolmogorov width decay and poor approximators in machine learning: shallow neural networks, random feature models and neural tangent kernels , 2020, Research in the Mathematical Sciences.

[37]  Francis Bach,et al.  Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.

[38]  Lei Wu,et al.  A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics , 2019, Science China Mathematics.

[39]  Wu Lei A PRIORI ESTIMATES OF THE POPULATION RISK FOR TWO-LAYER NEURAL NETWORKS , 2020 .

[40]  Quanquan Gu,et al.  A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks , 2020, NeurIPS.

[41]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[42]  Phan-Minh Nguyen,et al.  A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks , 2020, ArXiv.

[43]  Stephan Wojtowytsch,et al.  On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime , 2020, ArXiv.

[44]  E Weinan,et al.  A priori estimates for classification problems using neural networks , 2020, ArXiv.

[45]  Zhenjie Ren,et al.  Mean-field Langevin dynamics and energy landscape of neural networks , 2019, Annales de l'Institut Henri Poincaré, Probabilités et Statistiques.

[46]  E. Weinan,et al.  Representation formulas and pointwise properties for Barron functions , 2020, Calculus of Variations and Partial Differential Equations.

[47]  Annals of Mathematical Sciences and Applications , .