Learning Two Layer Rectified Neural Networks in Polynomial Time

Consider the following fundamental learning problem: given input examples $x \in \mathbb{R}^d$ and their vector-valued labels, as defined by an underlying generative neural network, recover the weight matrices of this network. We consider two-layer networks, mapping $\mathbb{R}^d$ to $\mathbb{R}^m$, with $k$ non-linear activation units $f(\cdot)$, where $f(x) = \max \{x , 0\}$ is the ReLU. Such a network is specified by two weight matrices, $\mathbf{U}^* \in \mathbb{R}^{m \times k}, \mathbf{V}^* \in \mathbb{R}^{k \times d}$, such that the label of an example $x \in \mathbb{R}^{d}$ is given by $\mathbf{U}^* f(\mathbf{V}^* x)$, where $f(\cdot)$ is applied coordinate-wise. Given $n$ samples as a matrix $\mathbf{X} \in \mathbb{R}^{d \times n}$ and the (possibly noisy) labels $\mathbf{U}^* f(\mathbf{V}^* \mathbf{X}) + \mathbf{E}$ of the network on these samples, where $\mathbf{E}$ is a noise matrix, our goal is to recover the weight matrices $\mathbf{U}^*$ and $\mathbf{V}^*$. In this work, we develop algorithms and hardness results under varying assumptions on the input and noise. Although the problem is NP-hard even for $k=2$, by assuming Gaussian marginals over the input $\mathbf{X}$ we are able to develop polynomial time algorithms for the approximate recovery of $\mathbf{U}^*$ and $\mathbf{V}^*$. Perhaps surprisingly, in the noiseless case our algorithms recover $\mathbf{U}^*,\mathbf{V}^*$ exactly, i.e., with no error. To the best of the our knowledge, this is the first algorithm to accomplish exact recovery. For the noisy case, we give the first polynomial time algorithm that approximately recovers the weights in the presence of mean-zero noise $\mathbf{E}$. Our algorithms generalize to a larger class of rectified activation functions, $f(x) = 0$ when $x\leq 0$, and $f(x) > 0$ otherwise.

[1]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[2]  David P. Woodru Sketching as a Tool for Numerical Linear Algebra , 2014 .

[3]  Anima Anandkumar,et al.  Provable Methods for Training Neural Networks with Sparse Connectivity , 2014, ICLR.

[4]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[5]  Varun Kanade,et al.  Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[6]  Yin Tat Lee,et al.  A Faster Cutting Plane Method and its Implications for Combinatorial and Convex Optimization , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[7]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[8]  Simon S. Du,et al.  Improved Learning of One-hidden-layer Convolutional Neural Networks with Overlaps , 2018, ArXiv.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  David Gross,et al.  Recovering Low-Rank Matrices From Few Coefficients in Any Basis , 2009, IEEE Transactions on Information Theory.

[11]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[12]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[13]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[14]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[15]  Huan Wang,et al.  Exact Recovery of Sparsely-Used Dictionaries , 2012, COLT.

[16]  Misha Denil,et al.  Predicting Parameters in Deep Learning , 2014 .

[17]  Santosh S. Vempala,et al.  Fourier PCA and robust tensor decomposition , 2013, STOC.

[18]  Yuchen Zhang,et al.  L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.

[19]  Raghu Meka,et al.  Learning One Convolutional Layer with Overlapping Patches , 2018, ICML.

[20]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[21]  Anima Anandkumar,et al.  Score Function Features for Discriminative Learning: Matrix and Tensor Framework , 2014, ArXiv.

[22]  Andrea Montanari,et al.  On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition , 2018, AISTATS.

[23]  Zhize Li,et al.  Learning Two-layer Neural Networks with Symmetric Inputs , 2018, ICLR.

[24]  Anima Anandkumar,et al.  Provable Tensor Methods for Learning Mixtures of Generalized Linear Models , 2014, AISTATS.

[25]  A. Dasgupta Asymptotic Theory of Statistics and Probability , 2008 .

[26]  Ankur Moitra,et al.  Noisy tensor completion via the sum-of-squares hierarchy , 2015, Mathematical Programming.

[27]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[28]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[29]  Anima Anandkumar,et al.  Tensor decompositions for learning latent variable models , 2012, J. Mach. Learn. Res..

[30]  Tengyu Ma,et al.  Decomposing Overcomplete 3rd Order Tensors using Sum-of-Squares Algorithms , 2015, APPROX-RANDOM.

[31]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[32]  Anima Anandkumar,et al.  Online and Differentially-Private Tensor Decomposition , 2016, NIPS.

[33]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[34]  Adam R. Klivans,et al.  Learning Neural Networks with Two Nonlinear Layers in Polynomial Time , 2017, COLT.

[35]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[36]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[37]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[38]  Sanjeev Arora,et al.  Provable learning of noisy-OR networks , 2016, STOC.

[39]  Santosh S. Vempala,et al.  Learning Convex Concepts from Gaussian Distributions with PCA , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[40]  J. Stephen Judd,et al.  Neural network design and the complexity of learning , 1990, Neural network modeling and connectionism.

[41]  Christopher J. Hillar,et al.  Most Tensor Problems Are NP-Hard , 2009, JACM.

[42]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[43]  W. Bryc The Normal Distribution: Characterizations with Applications , 1995 .

[44]  Sanjeev Arora,et al.  Provable ICA with Unknown Gaussian Noise, and Implications for Gaussian Mixtures and Autoencoders , 2012, Algorithmica.

[45]  Sham M. Kakade,et al.  Learning mixtures of spherical gaussians: moment methods and spectral decompositions , 2012, ITCS '13.

[46]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[47]  Guanghui Lan,et al.  Complexity of Training ReLU Neural Network , 2018, Discret. Optim..

[48]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[49]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[50]  E. Candès,et al.  Sparsity and incoherence in compressive sampling , 2006, math/0611957.

[51]  Aleksander Madry,et al.  Matrix Scaling and Balancing via Box Constrained Newton's Method and Interior Point Methods , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[52]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[53]  Adam R. Klivans,et al.  Learning Depth-Three Neural Networks in Polynomial Time , 2017, ArXiv.

[54]  Pasin Manurangsi,et al.  The Computational Complexity of Training ReLU(s) , 2018, ArXiv.

[55]  Yaniv Plan,et al.  Robust 1-bit Compressed Sensing and Sparse Logistic Regression: A Convex Programming Approach , 2012, IEEE Transactions on Information Theory.

[56]  Anima Anandkumar,et al.  Two SVDs Suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation , 2012, NIPS 2012.

[57]  Aditya Bhaskara,et al.  Provable Bounds for Learning Some Deep Representations , 2013, ICML.

[58]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[59]  Yuandong Tian,et al.  Symmetry-Breaking Convergence Analysis of Certain Two-layered Neural Networks with ReLU nonlinearity , 2017, ICLR.

[60]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[61]  David P. Woodruff,et al.  Sublinear Time Orthogonal Tensor Decomposition , 2016, NIPS.

[62]  David Steurer,et al.  Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[63]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[64]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[65]  Adam Tauman Kalai,et al.  Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression , 2011, NIPS.

[66]  L. Meng,et al.  The optimal perturbation bounds of the Moore–Penrose inverse under the Frobenius norm , 2010 .

[67]  Moritz Hardt,et al.  Understanding Alternating Minimization for Matrix Completion , 2013, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[68]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[69]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[70]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[71]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[72]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[73]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[74]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[75]  Aditya Bhaskara,et al.  Smoothed analysis of tensor decompositions , 2013, STOC.

[76]  Alan M. Frieze,et al.  Learning linear transformations , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[77]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[78]  Guanghui Lan,et al.  Complexity of Training ReLU Neural Networks , 2018 .

[79]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[80]  Nimrod Megiddo,et al.  On the complexity of polyhedral separability , 1988, Discret. Comput. Geom..