Tight Hardness Results for Training Depth-2 ReLU Networks

We prove several hardness results for training depth-2 neural networks with the ReLU activation function; these networks are simply weighted sums (that may include negative coefficients) of ReLUs. Our goal is to output a depth-2 neural network that minimizes the square loss with respect to a given training set. We prove that this problem is NP-hard already for a network with a single ReLU. We also prove NP-hardness for outputting a weighted sum of $k$ ReLUs minimizing the squared error (for $k>1$) even in the realizable setting (i.e., when the labels are consistent with an unknown depth-2 ReLU network). We are also able to obtain lower bounds on the running time in terms of the desired additive error $\epsilon$. To obtain our lower bounds, we use the Gap Exponential Time Hypothesis (Gap-ETH) as well as a new hypothesis regarding the hardness of approximating the well known Densest $\kappa$-Subgraph problem in subexponential time (these hypotheses are used separately in proving different lower bounds). For example, we prove that under reasonable hardness assumptions, any proper learning algorithm for finding the best fitting ReLU must run in time exponential in $1/\epsilon^2$. Together with a previous work regarding improperly learning a ReLU (Goel et al., COLT'17), this implies the first separation between proper and improper algorithms for learning a ReLU. We also study the problem of properly learning a depth-2 network of ReLUs with bounded weights giving new (worst-case) upper bounds on the running time needed to learn such networks both in the realizable and agnostic settings. Our upper bounds on the running time essentially matches our lower bounds in terms of the dependency on $\epsilon$.

[1]  Madhur Tulsiani CSP gaps and reductions in the lasserre hierarchy , 2009, STOC '09.

[2]  Prasad Raghavendra,et al.  A Birthday Repetition Theorem and Complexity of Approximating Dense CSPs , 2016, ICALP.

[3]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[4]  David P. Woodruff,et al.  Learning Two Layer Rectified Neural Networks in Polynomial Time , 2018, COLT.

[5]  Guanghui Lan,et al.  Complexity of Training ReLU Neural Networks , 2018 .

[6]  Aditya Bhaskara,et al.  Polynomial integrality gaps for strong SDP relaxations of Densest k-subgraph , 2011, SODA.

[7]  Benny Applebaum,et al.  On Basing Lower-Bounds for Learning on Worst-Case Assumptions , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[8]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[9]  Santosh S. Vempala,et al.  Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks , 2018, ArXiv.

[10]  J. Stephen Judd,et al.  On the complexity of loading shallow neural networks , 1988, J. Complex..

[11]  Michael Alekhnovich,et al.  Minimum propositional proof length is NP-hard to linearly approximate , 1998, Journal of Symbolic Logic.

[12]  Erez Petrank The hardness of approximation: Gap location , 2005, computational complexity.

[13]  Pasin Manurangsi,et al.  Almost-polynomial ratio ETH-hardness of approximating densest k-subgraph , 2016, STOC.

[14]  R. Schapire,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[15]  Van H. Vu On the Infeasibility of Training Neural Networks with Small Mean-Sqared Error , 1998, IEEE Trans. Inf. Theory.

[16]  Guy Kindler,et al.  Polynomially Low Error PCPs with polyloglog n Queries via Modular Composition , 2015, STOC.

[17]  Kane,et al.  Beyond the Worst-Case Analysis of Algorithms , 2020 .

[18]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[19]  Pasin Manurangsi,et al.  On approximating projection games , 2015 .

[20]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[21]  Luca Trevisan,et al.  From Gap-ETH to FPT-Inapproximability: Clique, Dominating Set, and More , 2017, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[22]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[23]  Vivek Srikumar,et al.  Expressiveness of Rectifier Networks , 2015, ICML.

[24]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[25]  Sanjeev Arora,et al.  Inapproximabilty of Densest κ-Subgraph from Average Case Hardness , 2011 .

[26]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[27]  Irit Dinur,et al.  On the hardness of approximating label-cover , 2004, Inf. Process. Lett..

[28]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[29]  Daniel M. Kane,et al.  Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin , 2019, NeurIPS.

[30]  Nimrod Megiddo,et al.  On the complexity of polyhedral separability , 1988, Discret. Comput. Geom..

[31]  Fahad Panolan,et al.  Refined Complexity of PCA with Outliers , 2019, ICML.

[32]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[33]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[34]  Yao Xie,et al.  ReLU Regression: Complexity, Exact and Approximation Algorithms , 2018, 1810.03592.

[35]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[36]  Varun Kanade,et al.  Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[37]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[38]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[39]  Carsten Lund,et al.  On the hardness of approximating minimization problems , 1994, JACM.

[40]  Grant Schoenebeck,et al.  Linear Level Lasserre Lower Bounds for Certain k-CSPs , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[41]  Ryan O'Donnell,et al.  How to Refute a Random CSP , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[42]  Aravindan Vijayaraghavan,et al.  Approximation Algorithms for Label Cover and The Log-Density Threshold , 2017, SODA.

[43]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[44]  Adam R. Klivans,et al.  Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals , 2019, NeurIPS.

[45]  Rocco A. Servedio,et al.  What Circuit Classes Can Be Learned with Non-Trivial Savings? , 2017, ITCS.

[46]  Russell Impagliazzo,et al.  On the Complexity of k-SAT , 2001, J. Comput. Syst. Sci..

[47]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[48]  Le Song,et al.  On the Complexity of Learning Neural Networks , 2017, NIPS.

[49]  Raman Arora,et al.  Understanding Deep Neural Networks with Rectified Linear Units , 2016, Electron. Colloquium Comput. Complex..

[50]  Guanghui Lan,et al.  Complexity of Training ReLU Neural Network , 2018, Discret. Optim..

[51]  Larry Stockmeyer,et al.  Planar 3-colorability is polynomial complete , 1973, SIGA.

[52]  Irit Dinur,et al.  Mildly exponential reduction from gap 3SAT to polynomial-gap label-cover , 2016, Electron. Colloquium Comput. Complex..

[53]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[54]  Yao Xie,et al.  An Approximation Algorithm for training One-Node ReLU Neural Network , 2018 .

[55]  Russell Impagliazzo,et al.  Which Problems Have Strongly Exponential Complexity? , 2001, J. Comput. Syst. Sci..

[56]  Pasin Manurangsi,et al.  The Computational Complexity of Training ReLU(s) , 2018, ArXiv.