Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology

Recent works have shown that gradient descent can find a global minimum for over-parameterized neural networks where the widths of all the hidden layers scale polynomially with $N$ ($N$ being the number of training samples). In this paper, we prove that, for deep networks, a single layer of width $N$ following the input layer suffices to ensure a similar guarantee. In particular, all the remaining layers are allowed to have constant widths, and form a pyramidal topology. We show an application of our result to the widely used Xavier's initialization and obtain an over-parameterization requirement for the single wide layer of order $N^2.$

[1]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[2]  Quynh Nguyen,et al.  On Connected Sublevel Sets in Deep Learning , 2019, ICML.

[3]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[4]  Dawei Li,et al.  On the Benefit of Width for Neural Networks: Disappearance of Basins , 2018, SIAM J. Optim..

[5]  Yuan Cao,et al.  How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , 2019, ICLR.

[6]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[7]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[8]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[9]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[10]  Taiji Suzuki,et al.  Refined Generalization Analysis of Gradient Descent for Over-parameterized Two-layer Neural Networks with Smooth Activations on Classification Problems , 2019, ArXiv.

[11]  Matthias Hein,et al.  On the loss landscape of a class of deep neural networks with no bad local valleys , 2018, ICLR.

[12]  Xin Yang,et al.  Quadratic Suffices for Over-parametrization via Matrix Chernoff Bound , 2019, ArXiv.

[13]  Dawei Li,et al.  Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations , 2018, ArXiv.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[17]  R. Adamczak,et al.  A note on the Hanson-Wright inequality for random vectors with dependencies , 2014, 1409.8457.

[18]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[19]  S. Bobkov,et al.  Higher order concentration of measure , 2017, Communications in Contemporary Mathematics.

[20]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[21]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[22]  J. Dolbeault,et al.  Sharp Interpolation Inequalities on the Sphere: New Methods and Consequences , 2012, 1210.1853.

[23]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[24]  Junmo Kim,et al.  Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Amit Daniely,et al.  Neural Networks Learning and Memorization with (almost) no Over-Parameterization , 2019, NeurIPS.

[26]  Quoc V. Le,et al.  The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study , 2019, ICML.

[27]  Suvrit Sra,et al.  Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[28]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[29]  Jaehoon Lee,et al.  On the infinite width limit of neural networks with a standard parameterization , 2020, ArXiv.

[30]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[31]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[32]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[33]  Roman Vershynin,et al.  Memory capacity of neural networks with threshold and ReLU activations , 2020, ArXiv.

[34]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[35]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[36]  G. Stewart Perturbation theory for the singular value decomposition , 1990 .

[37]  R. Adamczak,et al.  Restricted Isometry Property of Matrices with Independent Columns and Neighborly Polytopes by Random Sampling , 2009, 0904.4723.

[38]  Matthias Hein,et al.  Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[39]  Samet Oymak,et al.  Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.

[40]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[41]  R. Adamczak,et al.  Concentration inequalities for non-Lipschitz functions with bounded derivatives of higher order , 2013, 1304.1826.

[42]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[43]  Peter L. Bartlett,et al.  Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[44]  Xiaoxia Wu,et al.  Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network , 2019, ArXiv.

[45]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[46]  B. Mityagin The Zero Set of a Real Analytic Function , 2015, Mathematical Notes.

[47]  Eric B. Baum,et al.  On the capabilities of multilayer perceptrons , 1988, J. Complex..

[48]  Rong Ge,et al.  Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently , 2019, ArXiv.

[49]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[50]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[52]  Suvrit Sra,et al.  Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[53]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[54]  Daniel W. Stroock,et al.  Moment estimates derived from Poincar'e and log-arithmic Sobolev inequalities , 1994 .

[55]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2020, ICLR.