Learning Deep ReLU Networks Is Fixed-Parameter Tractable

We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs and obtain the first nontrivial results for networks of depth more than two. We give an algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters. Our bounds depend on the number of hidden units, depth, spectral norm of the weight matrices, and Lipschitz constant of the overall network (we show that some dependence on the Lipschitz constant is necessary). We also give a bound that is doubly exponential in the size of the network but is independent of spectral norm. These results provably cannot be obtained using gradient-based methods and give the first example of a class of efficiently learnable neural networks that gradient descent will fail to learn. In contrast, prior work for learning networks of depth three or higher requires exponential time in the ambient dimension, even when the above parameters are bounded by a constant. Additionally, all prior work for the depth-two case requires well-conditioned weights and/or positive coefficients to obtain efficient run-times. Our algorithm does not require these assumptions. Our main technical tool is a type of filtered PCA that can be used to iteratively recover an approximate basis for the subspace spanned by the hidden units in the first layer. Our analysis leverages new structural results on lattice polynomials from tropical geometry.

[1]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[2]  Van H. Vu On the Infeasibility of Training Neural Networks with Small Mean-Sqared Error , 1998, IEEE Trans. Inf. Theory.

[3]  Pasin Manurangsi,et al.  The Computational Complexity of Training ReLU(s) , 2018, ArXiv.

[4]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[5]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[6]  Mark Tygert,et al.  A Randomized Algorithm for Principal Component Analysis , 2008, SIAM J. Matrix Anal. Appl..

[7]  Adam R. Klivans,et al.  Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals , 2019, NeurIPS.

[8]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[9]  Pramod Viswanath,et al.  Learning One-hidden-layer Neural Networks under General Input Distributions , 2018, AISTATS.

[10]  Jerry Li,et al.  Fast Algorithms for Segmented Regression , 2016, ICML.

[11]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[12]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[13]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[14]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[15]  S. Ovchinnikov Max-Min Representation of Piecewise Linear Functions , 2000, math/0009026.

[16]  D. Brillinger A Generalized Linear Model With “Gaussian” Regressor Variables , 2012 .

[17]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[18]  Elchanan Mossel,et al.  Is your function low dimensional? , 2018, COLT.

[19]  Daniel M. Kane,et al.  Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks , 2020, COLT.

[20]  David P. Woodruff,et al.  Learning Two Layer Rectified Neural Networks in Polynomial Time , 2018, COLT.

[21]  Le Song,et al.  On the Complexity of Learning Neural Networks , 2017, NIPS.

[22]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[23]  Yuanzhi Li,et al.  Even Faster SVD Decomposition Yet Without Agonizing Pain , 2016, NIPS.

[24]  Ker-Chau Li,et al.  On Principal Hessian Directions for Data Visualization and Dimension Reduction: Another Application of Stein's Lemma , 1992 .

[25]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[26]  Yuchen Zhang,et al.  L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.

[27]  Raghu Meka,et al.  Learning One Convolutional Layer with Overlapping Patches , 2018, ICML.

[28]  Francis Bach,et al.  Slice inverse regression with score functions , 2018 .

[29]  Santosh S. Vempala,et al.  Structure from Local Optima: Learning Subspace Juntas via Higher Order PCA , 2011, 1108.3329.

[30]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[31]  Amit Daniely,et al.  Hardness of Learning Neural Networks with Natural Weights , 2020, NeurIPS.

[32]  Adam R. Klivans,et al.  Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent , 2020, ICML.

[33]  Varun Kanade,et al.  Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[34]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[35]  Raghu Meka,et al.  Learning Polynomials of Few Relevant Dimensions , 2020, COLT.

[36]  Yaniv Plan,et al.  The Generalized Lasso With Non-Linear Observations , 2015, IEEE Transactions on Information Theory.

[37]  Zhize Li,et al.  Learning Two-layer Neural Networks with Symmetric Inputs , 2018, ICLR.

[38]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[39]  Adam R. Klivans,et al.  Learning Neural Networks with Two Nonlinear Layers in Polynomial Time , 2017, COLT.

[40]  Daniel J. Hsu,et al.  Learning Single-Index Models in Gaussian Space , 2018, COLT.

[41]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[42]  Navin Goyal,et al.  Non-Gaussian component analysis using entropy methods , 2018, STOC.

[43]  Alexander A. Sherstov,et al.  Cryptographic Hardness for Learning Intersections of Halfspaces , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[44]  Xiao Zhang,et al.  Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[45]  Ilias Diakonikolas,et al.  Approximation Schemes for ReLU Regression , 2020, COLT.

[46]  Rina Panigrahy,et al.  Electron-Proton Dynamics in Deep Learning , 2017, ArXiv.

[47]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[48]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[49]  John Wilmes,et al.  Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds , 2018, COLT.

[50]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[51]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .