Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization

Although deep learning has shown its powerful performance in many applications, the mathematical principles behind neural networks are still mysterious. In this paper, we consider the problem of learning a one-hidden-layer neural network with quadratic activations. We focus on the under-parameterized regime where the number of hidden units is smaller than the dimension of the inputs. We shall propose to solve the problem via a provable gradient-based method with random initialization. For the non-convex neural networks training problem we reveal that the gradient descent iterates are able to enter a local region that enjoys strong convexity and smoothness within a few iterations, and then provably converges to a globally optimal model at a linear rate with near-optimal sample complexity. We further corroborate our theoretical findings via various experiments.

[1]  Zhi Ding,et al.  Blind Over-the-Air Computation and Data Fusion via Provable Wirtinger Flow , 2018, IEEE Transactions on Signal Processing.

[2]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, NIPS.

[3]  Yonina C. Eldar,et al.  The Global Optimization Geometry of Shallow Linear Neural Networks , 2018, Journal of Mathematical Imaging and Vision.

[4]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[5]  Yuxin Chen,et al.  Nonconvex Matrix Factorization from Rank-One Measurements , 2019, AISTATS.

[6]  Yuxin Chen,et al.  Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval , 2018, Mathematical Programming.

[7]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[8]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[9]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[10]  Martin J. Wainwright,et al.  Convexified Convolutional Neural Networks , 2016, ICML.

[11]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[12]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[13]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[14]  S. Lang Real and Functional Analysis , 1983 .

[15]  Chinmay Hegde,et al.  Towards Provable Learning of Polynomial Neural Networks Using Low-Rank Matrix Estimation , 2018, AISTATS.

[16]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[17]  Yuxin Chen,et al.  Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion, and Blind Deconvolution , 2017, Found. Comput. Math..

[18]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[19]  Maxim Sviridenko,et al.  Concentration and moment inequalities for polynomials of independent random variables , 2012, SODA.

[20]  Wei Hu,et al.  A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks , 2018, ICLR.

[21]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[22]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[23]  Yuanming Shi,et al.  Nonconvex Demixing From Bilinear Measurements , 2018, IEEE Transactions on Signal Processing.