On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition

We establish connections between the problem of learning a two-layers neural network with good generalization error and tensor decomposition. We consider a model with input $\boldsymbol x \in \mathbb R^d$, $r$ hidden units with weights $\{\boldsymbol w_i\}_{1\le i \le r}$ and output $y\in \mathbb R$, i.e., $y=\sum_{i=1}^r \sigma(\left \langle \boldsymbol x, \boldsymbol w_i\right \rangle)$, where $\sigma$ denotes the activation function. First, we show that, if we cannot learn the weights $\{\boldsymbol w_i\}_{1\le i \le r}$ accurately, then the neural network does not generalize well. More specifically, the generalization error is close to that of a trivial predictor with access only to the norm of the input. This result holds for any activation function, and it requires that the weights are roughly isotropic and the input distribution is Gaussian, which is a typical assumption in the theoretical literature. Then, we show that the problem of learning the weights $\{\boldsymbol w_i\}_{1\le i \le r}$ is at least as hard as the problem of tensor decomposition. This result holds for any input distribution and assumes that the activation function is a polynomial whose degree is related to the order of the tensor to be decomposed. By putting everything together, we prove that learning a two-layers neural network that generalizes well is at least as hard as tensor decomposition. It has been observed that neural network models with more parameters than training samples often generalize well, even if the problem is highly underdetermined. This means that the learning algorithm does not estimate the weights accurately and yet is able to yield a good generalization error. This paper shows that such a phenomenon cannot occur when the input distribution is Gaussian and the weights are roughly isotropic. We also provide numerical evidence supporting our theoretical findings.

[1]  Anima Anandkumar,et al.  Analyzing Tensor Power Method Dynamics in Overcomplete Regime , 2014, J. Mach. Learn. Res..

[2]  Jirí Síma,et al.  Training a Single Sigmoidal Neuron Is Hard , 2002, Neural Comput..

[3]  Tengyu Ma,et al.  Decomposing Overcomplete 3rd Order Tensors using Sum-of-Squares Algorithms , 2015, APPROX-RANDOM.

[4]  Cedric E. Ginestet Spectral Analysis of Large Dimensional Random Matrices, 2nd edn , 2012 .

[5]  Anima Anandkumar,et al.  Learning Overcomplete Latent Variable Models through Tensor Methods , 2014, COLT.

[6]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[7]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[8]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[9]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[10]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[11]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[12]  Tselil Schramm,et al.  Fast and robust tensor decomposition with applications to dictionary learning , 2017, COLT.

[13]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[14]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[15]  David Steurer,et al.  Sum-of-squares proofs and the quest toward optimal algorithms , 2014, Electron. Colloquium Comput. Complex..

[16]  Yuchen Zhang,et al.  L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.

[17]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[18]  Joos Vandewalle,et al.  Blind source separation by simultaneous third-order tensor diagonalization , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[19]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[20]  David Steurer,et al.  Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[21]  Amit Daniely,et al.  Complexity theoretic limitations on learning halfspaces , 2015, STOC.

[22]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[23]  Jonathan Shi,et al.  Tensor principal component analysis via sum-of-square proofs , 2015, COLT.

[24]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[25]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[26]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[27]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[28]  Christian Kuhlmann,et al.  Hardness Results for General Two-Layer Neural Networks , 2000, COLT.

[29]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[30]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[31]  Johan Håstad,et al.  Tensor Rank is NP-Complete , 1989, ICALP.

[32]  Yuandong Tian,et al.  Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity , 2017, ICLR.

[33]  P. Bartlett,et al.  Hardness results for neural network approximation problems , 1999, Theor. Comput. Sci..

[34]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[35]  Tengyu Ma,et al.  Polynomial-Time Tensor Decompositions with Sum-of-Squares , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[36]  Joan Bruna,et al.  Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[37]  Aditya Bhaskara,et al.  Smoothed analysis of tensor decompositions , 2013, STOC.

[38]  Rina Panigrahy,et al.  Convergence Results for Neural Networks via Electrodynamics , 2017, ITCS.

[39]  Prasad Raghavendra,et al.  The Power of Sum-of-Squares for Detecting Hidden Structures , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[40]  Tselil Schramm,et al.  Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors , 2015, STOC.

[41]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[42]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[43]  Wasim Huleihel,et al.  Reducibility and Computational Lower Bounds for Problems with Planted Sparse Structure , 2018, COLT.

[44]  Ryan O'Donnell,et al.  Analysis of Boolean Functions , 2014, ArXiv.

[45]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[46]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[47]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[48]  Tselil Schramm,et al.  Low-Rank Matrix Completion with Adversarial Missing Entries , 2015, ArXiv.

[49]  Philippe Rigollet,et al.  Complexity Theoretic Lower Bounds for Sparse Principal Component Detection , 2013, COLT.

[50]  Pravesh Kothari,et al.  A Nearly Tight Sum-of-Squares Lower Bound for the Planted Clique Problem , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[51]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[52]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[53]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.