Scaling Limit of Neural Networks with the Xavier Initialization and Convergence to a Global Minimum

We analyze single-layer neural networks with the Xavier initialization in the asymptotic regime of large numbers of hidden units and large numbers of stochastic gradient descent training steps. The evolution of the neural network during training can be viewed as a stochastic system and, using techniques from stochastic analysis, we prove the neural network converges in distribution to a random ODE with a Gaussian distribution. The limit is completely different than in the typical mean-field results for neural networks due to the $\frac{1}{\sqrt{N}}$ normalization factor in the Xavier initialization (versus the $\frac{1}{N}$ factor in the typical mean-field framework). Although the pre-limit problem of optimizing a neural network is non-convex (and therefore the neural network may converge to a local minimum), the limit equation minimizes a (quadratic) convex objective function and therefore converges to a global minimum. Furthermore, under reasonable assumptions, the matrix in the limiting quadratic objective function is positive definite and thus the neural network (in the limit) will converge to a global minimum with zero loss on the training set.

[1]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[2]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[3]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[4]  Yoshifusa Ito,et al.  Nonlinearity creates linear independence , 1996, Adv. Comput. Math..

[5]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[6]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[7]  S. Ethier,et al.  Markov Processes: Characterization and Convergence , 2005 .

[8]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[9]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[10]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[11]  Kurt Hornik,et al.  Convergence of learning algorithms with constant learning rates , 1991, IEEE Trans. Neural Networks.

[12]  Justin A. Sirignano,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[13]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[14]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[15]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[16]  Justin A. Sirignano,et al.  Mean Field Analysis of Deep Neural Networks , 2019, Math. Oper. Res..