Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks

Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).

[1]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[2]  Matus Telgarsky,et al.  Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[3]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[4]  Yonina C. Eldar,et al.  The Global Optimization Geometry of Shallow Linear Neural Networks , 2018, Journal of Mathematical Imaging and Vision.

[5]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[6]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[7]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[8]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[9]  Mikhail Belkin,et al.  Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.

[10]  Mikhail Belkin,et al.  Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.

[11]  Yingbin Liang,et al.  Local Geometry of One-Hidden-Layer Neural Networks for Logistic Regression , 2018, ArXiv.

[12]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[13]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[14]  F. Clarke Generalized gradients and applications , 1975 .

[15]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[16]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[17]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[18]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[19]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[20]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[21]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[22]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[23]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[24]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[25]  Mikhail Belkin,et al.  Does data interpolation contradict statistical optimality? , 2018, AISTATS.

[26]  M. Ledoux The concentration of measure phenomenon , 2001 .

[27]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[28]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[29]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[30]  Joan Bruna,et al.  Spurious Valleys in Two-layer Neural Network Optimization Landscapes , 2018, 1802.06384.

[31]  Michael I. Jordan,et al.  On the Local Minima of the Empirical Risk , 2018, NeurIPS.

[32]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[33]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[34]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[35]  Sanjeev Arora,et al.  On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[36]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[37]  Samet Oymak,et al.  Stochastic Gradient Descent Learns State Equations with Nonlinear Activations , 2018, COLT.

[38]  J. Schur Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen. , 1911 .

[39]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[40]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[41]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[42]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[43]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[44]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[45]  Jiashi Feng,et al.  The Landscape of Deep Learning Algorithms , 2017, ArXiv.

[46]  Samet Oymak,et al.  Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian , 2019, ArXiv.

[47]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[48]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[49]  Suvrit Sra,et al.  A Critical View of Global Optimality in Deep Learning , 2018, ArXiv.