Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks

We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs by training a neural network with a single hidden layer of nonlinear gates. Our main finding is that GD starting from a randomly initialized network converges in mean squared loss to the minimum error (in 2-norm) of the best approximation of the target function using a polynomial of degree at most $k$. Moreover, the size of the network and number of iterations needed are both bounded by $n^{O(k)}$. The core of our analysis is the following existence theorem, which is of independent interest: for any $\epsilon > 0$, any bounded function that has a degree-$k$ polynomial approximation with error $\epsilon_0$ (in 2-norm), can be approximated to within error $\epsilon_0 + \epsilon$ as a linear combination of $n^{O(k)} \mbox{poly}(1/\epsilon)$ randomly chosen gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to $k$. In particular, this applies to training networks of unbiased sigmoids and ReLUs.

[1]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[2]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[3]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[4]  Le Song,et al.  On the Complexity of Learning Neural Networks , 2017, NIPS.

[5]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[6]  Anima Anandkumar,et al.  Generalization Bounds for Neural Networks through Tensor Factorization , 2015, ArXiv.

[7]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[8]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[9]  Varun Kanade,et al.  Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[10]  Amit Daniely,et al.  Complexity Theoretic Limitations on Learning DNF's , 2014, COLT.

[11]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[12]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[13]  Adam R. Klivans Cryptographic Hardness of Learning , 2016, Encyclopedia of Algorithms.

[14]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[15]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[16]  Tengyu Ma,et al.  Gradient Descent Learns Linear Dynamical Systems , 2016, J. Mach. Learn. Res..

[17]  Harmonics Book,et al.  Geometric Applications Of Fourier Series And Spherical Harmonics , 2016 .

[18]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[19]  Anima Anandkumar,et al.  Provable Tensor Methods for Learning Mixtures of Generalized Linear Models , 2014, AISTATS.

[20]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[21]  Adam R. Klivans,et al.  Learning Depth-Three Neural Networks in Polynomial Time , 2017, ArXiv.

[22]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[23]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[24]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[25]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.