论文信息 - Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods - 字舞流文

Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

Author(s): Janzamin, Majid; Sedghi, Hanie; Anandkumar, Anima | Abstract: Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.

Anima Anandkumar | Majid Janzamin | Hanie Sedghi | Anima Anandkumar | Hanie Sedghi | Majid Janzamin

[1] Hervé Bourlard,et al. Generalization and Parameter Estimation in Feedforward Netws: Some Experiments , 1989, NIPS.

[2] J. Slawny,et al. Back propagation fails to separate where perceptrons succeed , 1989 .

[3] George Cybenko,et al. Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[4] Kurt Hornik,et al. Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[5] Kurt Hornik,et al. Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[6] Kurt Hornik,et al. Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[7] Ronald L. Rivest,et al. Training a 3-node neural network is NP-complete , 1988, COLT '88.

[8] Alberto Tesi,et al. On the Problem of Local Minima in Backpropagation , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[9] Andrew R. Barron,et al. Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[10] Umesh V. Vazirani,et al. An Introduction to Computational Learning Theory , 1994 .

[11] Robert J. Marks,et al. Fourier Analysis and Filtering of a Single Hidden Layer Perceptron , 1994 .

[12] Raúl Rojas,et al. Neural Networks - A Systematic Introduction , 1996 .

[13] Peter L. Bartlett,et al. The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[14] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[15] Christian Kuhlmann,et al. Hardness Results for General Two-Layer Neural Networks , 2000, COLT.

[16] Gene H. Golub,et al. Rank-One Approximation to High Order Tensors , 2001, SIAM J. Matrix Anal. Appl..

[17] Jirí Síma,et al. Training a Single Sigmoidal Neuron Is Hard , 2002, Neural Comput..

[18] P. Bartlett,et al. Hardness results for neural network approximation problems , 1999, Theor. Comput. Sci..

[19] Andrew R. Barron,et al. Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[20] Aapo Hyvärinen,et al. Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[21] M. Rudelson,et al. The smallest singular value of a random rectangular matrix , 2008, 0802.3956.

[22] Michael I. Jordan,et al. Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[23] Pablo A. Parrilo,et al. Latent variable graphical model selection via convex optimization , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[24] Shachar Lovett. An elementary proof of anti-concentration of polynomials in Gaussian variables , 2010, Electron. Colloquium Comput. Complex..

[25] Nando de Freitas,et al. On Autoencoders and Score Matching for Energy Based Models , 2011, ICML.

[26] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[27] Sham M. Kakade,et al. Random Design Analysis of Ridge Regression , 2012, COLT.

[28] Anima Anandkumar,et al. A Tensor Spectral Approach to Learning Mixed Membership Community Models , 2013, COLT.

[29] Antonio Auffinger,et al. Complexity of random smooth functions on the high-dimensional sphere , 2011, 1110.5872.

[30] Anima Anandkumar,et al. Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[31] Aditya Bhaskara,et al. Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[32] Yoshua Bengio,et al. What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[33] Roi Livni,et al. On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[34] Prateek Jain,et al. Learning Sparsely Used Overcomplete Dictionaries , 2014, COLT.

[35] Aditya Bhaskara,et al. Smoothed analysis of tensor decompositions , 2013, STOC.

[36] Yann LeCun,et al. The Loss Surface of Multilayer Networks , 2014, ArXiv.

[37] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[38] Anima Anandkumar,et al. Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[39] Le Song,et al. Nonparametric Estimation of Multi-View Latent Variable Models , 2013, ICML.

[40] Alexandr Andoni,et al. Learning Polynomials with Neural Networks , 2014, ICML.

[41] Shai Ben-David,et al. Understanding Machine Learning: From Theory to Algorithms , 2014 .

[42] Anima Anandkumar,et al. Score Function Features for Discriminative Learning: Matrix and Tensor Framework , 2014, ArXiv.

[43] Pravesh Kothari,et al. Almost Optimal Pseudorandom Generators for Spherical Caps , 2014, ArXiv.

[44] Alexander J. Smola,et al. Fast and Guaranteed Tensor Decomposition via Sketching , 2015, NIPS.

[45] René Vidal,et al. Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[46] Anima Anandkumar,et al. Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[47] Anima Anandkumar,et al. Learning Overcomplete Latent Variable Models through Tensor Methods , 2014, COLT.

[48] Yuchen Zhang,et al. L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.

[49] Yoram Singer,et al. Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[50] Aapo Hyvärinen,et al. Density Estimation in Infinite Dimensional Exponential Families , 2013, J. Mach. Learn. Res..