Generalization Bounds for Neural Networks through Tensor Factorization

Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for training a two-layer neural network. We prove efficient generalization bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for generalizability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, for the first time, we have a computationally efficient method with guaranteed generalization bounds for training neural networks.

[1]  J. Slawny,et al.  Back propagation fails to separate where perceptrons succeed , 1989 .

[2]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[3]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[4]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[5]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[6]  Alberto Tesi,et al.  On the Problem of Local Minima in Backpropagation , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[8]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[9]  Robert J. Marks,et al.  Fourier Analysis and Filtering of a Single Hidden Layer Perceptron , 1994 .

[10]  Raúl Rojas,et al.  Neural Networks - A Systematic Introduction , 1996 .

[11]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[12]  Christian Kuhlmann,et al.  Hardness Results for General Two-Layer Neural Networks , 2000, COLT.

[13]  Gene H. Golub,et al.  Rank-One Approximation to High Order Tensors , 2001, SIAM J. Matrix Anal. Appl..

[14]  Jirí Síma,et al.  Training a Single Sigmoidal Neuron Is Hard , 2002, Neural Comput..

[15]  P. Bartlett,et al.  Hardness results for neural network approximation problems , 1999, Theor. Comput. Sci..

[16]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[17]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[18]  Nando de Freitas,et al.  On Autoencoders and Score Matching for Energy Based Models , 2011, ICML.

[19]  Anima Anandkumar,et al.  Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-1 Updates , 2014, ArXiv.

[20]  Yoshua Bengio,et al.  What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[21]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[22]  Aditya Bhaskara,et al.  Smoothed analysis of tensor decompositions , 2013, STOC.

[23]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[24]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[25]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[26]  Anima Anandkumar,et al.  Score Function Features for Discriminative Learning: Matrix and Tensor Framework , 2014, ArXiv.

[27]  Pravesh Kothari,et al.  Almost Optimal Pseudorandom Generators for Spherical Caps , 2014, ArXiv.

[28]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[29]  Anima Anandkumar,et al.  Learning Overcomplete Latent Variable Models through Tensor Methods , 2014, COLT.