论文信息 - Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology - 字舞流文

Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology

Recent works have shown that gradient descent can find a global minimum for over-parameterized neural networks where the widths of all the hidden layers scale polynomially with $N$ ($N$ being the number of training samples). In this paper, we prove that, for deep networks, a single layer of width $N$ following the input layer suffices to ensure a similar guarantee. In particular, all the remaining layers are allowed to have constant widths, and form a pyramidal topology. We show an application of our result to the widely used Xavier's initialization and obtain an over-parameterization requirement for the single wide layer of order $N^2.$

Marco Mondelli | Quynh Nguyen | Quynh N. Nguyen | Marco Mondelli

[1] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[2] Quynh Nguyen,et al. On Connected Sublevel Sets in Deep Learning , 2019, ICML.

[3] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[4] Dawei Li,et al. On the Benefit of Width for Neural Networks: Disappearance of Basins , 2018, SIAM J. Optim..

[5] Yuan Cao,et al. How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , 2019, ICLR.

[6] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[7] Gábor Lugosi,et al. Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[8] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[9] Yuan Cao,et al. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[10] Taiji Suzuki,et al. Refined Generalization Analysis of Gradient Descent for Over-parameterized Two-layer Neural Networks with Smooth Activations on Classification Problems , 2019, ArXiv.

[11] Matthias Hein,et al. On the loss landscape of a class of deep neural networks with no bad local valleys , 2018, ICLR.

[12] Xin Yang,et al. Quadratic Suffices for Over-parametrization via Matrix Chernoff Bound , 2019, ArXiv.

[13] Dawei Li,et al. Over-Parameterized Deep Neural Networks Have No Strict Local Minima For Any Continuous Activations , 2018, ArXiv.

[14] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[16] Francis Bach,et al. On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[17] R. Adamczak,et al. A note on the Hanson-Wright inequality for random vectors with dependencies , 2014, 1409.8457.

[18] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[19] S. Bobkov,et al. Higher order concentration of measure , 2017, Communications in Contemporary Mathematics.

[20] Adel Javanmard,et al. Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[21] Quanquan Gu,et al. An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[22] J. Dolbeault,et al. Sharp Interpolation Inequalities on the Sphere: New Methods and Consequences , 2012, 1210.1853.

[23] O. Papaspiliopoulos. High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[24] Junmo Kim,et al. Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Amit Daniely,et al. Neural Networks Learning and Memorization with (almost) no Over-Parameterization , 2019, NeurIPS.

[26] Quoc V. Le,et al. The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study , 2019, ICML.

[27] Suvrit Sra,et al. Small nonlinearities in activation functions create bad local minima in neural networks , 2018, ICLR.

[28] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.

[29] Jaehoon Lee,et al. On the infinite width limit of neural networks with a standard parameterization , 2020, ArXiv.

[30] Ruosong Wang,et al. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[31] David Haussler,et al. What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[32] Roman Vershynin,et al. Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[33] Roman Vershynin,et al. Memory capacity of neural networks with threshold and ReLU activations , 2020, ArXiv.

[34] Roman Vershynin,et al. High-Dimensional Probability , 2018 .

[35] Milton Abramowitz,et al. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[36] G. Stewart. Perturbation theory for the singular value decomposition , 1990 .

[37] R. Adamczak,et al. Restricted Isometry Property of Matrices with Independent Columns and Neighborly Polytopes by Random Sampling , 2009, 0904.4723.

[38] Matthias Hein,et al. Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.

[39] Samet Oymak,et al. Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.

[40] Peter Auer,et al. Exponentially many local minima for single neurons , 1995, NIPS.

[41] R. Adamczak,et al. Concentration inequalities for non-Lipschitz functions with bounded derivatives of higher order , 2013, 1304.1826.

[42] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[43] Peter L. Bartlett,et al. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks , 2017, J. Mach. Learn. Res..

[44] Xiaoxia Wu,et al. Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network , 2019, ArXiv.

[45] Ohad Shamir,et al. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[46] B. Mityagin. The Zero Set of a Real Analytic Function , 2015, Mathematical Notes.

[47] Eric B. Baum,et al. On the capabilities of multilayer perceptrons , 1988, J. Complex..

[48] Rong Ge,et al. Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently , 2019, ArXiv.

[49] Matthias Hein,et al. The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[50] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[52] Suvrit Sra,et al. Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity , 2018, NeurIPS.

[53] Shai Shalev-Shwartz,et al. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[54] Daniel W. Stroock,et al. Moment estimates derived from Poincar'e and log-arithmic Sobolev inequalities , 1994 .

[55] Matus Telgarsky,et al. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2020, ICLR.