How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks?

A recent line of research on deep learning focuses on the extremely over-parameterized setting, and shows that when the network width is larger than a high degree polynomial of the training sample size $n$ and the inverse of the target accuracy $\epsilon^{-1}$, deep neural networks learned by (stochastic) gradient descent enjoy nice optimization and generalization guarantees. Very recently, it is shown that under certain margin assumption on the training data, a polylogarithmic width condition suffices for two-layer ReLU networks to converge and generalize (Ji and Telgarsky, 2019). However, how much over-parameterization is sufficient to guarantee optimization and generalization for deep neural networks still remains an open question. In this work, we establish sharp optimization and generalization guarantees for deep ReLU networks. Under various assumptions made in previous work, our optimization and generalization guarantees hold with network width polylogarithmic in $n$ and $\epsilon^{-1}$. Our results push the study of over-parameterized deep neural networks towards more practical settings.

[1]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[2]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[3]  N. U. Prabhu,et al.  Stochastic Processes and Their Applications , 1999 .

[4]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[5]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[6]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[7]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[8]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[9]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[10]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[12]  Yi Zhou,et al.  Critical Points of Neural Networks: Analytical Forms and Landscape Properties , 2017, ArXiv.

[13]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[14]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[15]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[16]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[17]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[18]  Joan Bruna,et al.  Neural Networks with Finite Intrinsic Dimension have no Spurious Valleys , 2018, ArXiv.

[19]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[20]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[21]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[22]  Suvrit Sra,et al.  Global optimality conditions for deep neural networks , 2017, ICLR.

[23]  Matus Telgarsky,et al.  Risk and parameter convergence of logistic regression , 2018, ArXiv.

[24]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[25]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[26]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[27]  Nathan Srebro,et al.  Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[28]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[29]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[30]  Junwei Lu,et al.  On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond , 2018, ArXiv.

[31]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[32]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[33]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[34]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[35]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[36]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[37]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[38]  Taiji Suzuki,et al.  Refined Generalization Analysis of Gradient Descent for Over-parameterized Two-layer Neural Networks with Smooth Activations on Classification Problems , 2019, ArXiv.

[39]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[40]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[41]  Quynh Nguyen,et al.  On Connected Sublevel Sets in Deep Learning , 2019, ICML.

[42]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[43]  Hanze Dong,et al.  Over Parameterized Two-level Neural Networks Can Learn Near Optimal Feature Representations , 2019, ArXiv.

[44]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[45]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[46]  Yuan Cao,et al.  Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks , 2019, NeurIPS.

[47]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[48]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[49]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[50]  Matus Telgarsky,et al.  The implicit bias of gradient descent on nonseparable data , 2019, COLT.

[51]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[52]  Tong Zhang,et al.  Convex Formulation of Overparameterized Deep Neural Networks , 2019, IEEE Transactions on Information Theory.

[53]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[54]  Samet Oymak,et al.  Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? , 2018, ICML.

[55]  Jiaoyang Huang,et al.  Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[56]  Yuanzhi Li,et al.  On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.

[57]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[58]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[59]  Nathan Srebro,et al.  Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate , 2018, AISTATS.

[60]  Yuan Cao,et al.  Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks , 2019, NeurIPS.

[61]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[62]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[63]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[64]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[65]  Jason D. Lee,et al.  Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks , 2019, ICLR.

[66]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2019, ICLR.

[67]  Quanquan Gu,et al.  Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[68]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[69]  Samet Oymak,et al.  Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.

[70]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[71]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[72]  Ohad Shamir,et al.  Gradient Methods Never Overfit On Separable Data , 2020, J. Mach. Learn. Res..

[73]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.