论文信息 - Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks - 字舞流文

Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks

We analyze Gradient Descent applied to learning a bounded target function on $n$ real-valued inputs by training a neural network with a single hidden layer of nonlinear gates. Our main finding is that GD starting from a randomly initialized network converges in mean squared loss to the minimum error (in 2-norm) of the best approximation of the target function using a polynomial of degree at most $k$. Moreover, the size of the network and number of iterations needed are both bounded by $n^{O(k)}$. The core of our analysis is the following existence theorem, which is of independent interest: for any $\epsilon > 0$, any bounded function that has a degree-$k$ polynomial approximation with error $\epsilon_0$ (in 2-norm), can be approximated to within error $\epsilon_0 + \epsilon$ as a linear combination of $n^{O(k)} \mbox{poly}(1/\epsilon)$ randomly chosen gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to $k$. In particular, this applies to training networks of unbiased sigmoids and ReLUs.

Santosh S. Vempala | John Wilmes | S. Vempala | John Wilmes

[1] Amit Daniely,et al. SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[2] Francis R. Bach,et al. Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[3] Tengyu Ma,et al. Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[4] Le Song,et al. On the Complexity of Learning Neural Networks , 2017, NIPS.

[5] Roi Livni,et al. On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[6] Anima Anandkumar,et al. Generalization Bounds for Neural Networks through Tensor Factorization , 2015, ArXiv.

[7] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[8] Andrew R. Barron,et al. Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[9] Varun Kanade,et al. Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[10] Amit Daniely,et al. Complexity Theoretic Limitations on Learning DNF's , 2014, COLT.

[11] Ohad Shamir,et al. Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[12] Alexandr Andoni,et al. Learning Polynomials with Neural Networks , 2014, ICML.

[13] Adam R. Klivans. Cryptographic Hardness of Learning , 2016, Encyclopedia of Algorithms.

[14] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.

[15] Aditya Bhaskara,et al. Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[16] Tengyu Ma,et al. Gradient Descent Learns Linear Dynamical Systems , 2016, J. Mach. Learn. Res..

[17] Harmonics Book,et al. Geometric Applications Of Fourier Series And Spherical Harmonics , 2016 .

[18] G. Lewicki,et al. Approximation by Superpositions of a Sigmoidal Function , 2003 .

[19] Anima Anandkumar,et al. Provable Tensor Methods for Learning Mixtures of Generalized Linear Models , 2014, AISTATS.

[20] Amir Globerson,et al. Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[21] Adam R. Klivans,et al. Learning Depth-Three Neural Networks in Polynomial Time , 2017, ArXiv.

[22] Ohad Shamir,et al. The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[23] Matus Telgarsky,et al. Benefits of Depth in Neural Networks , 2016, COLT.

[24] Benjamin Recht,et al. Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[25] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.