Statistical-Query Lower Bounds via Functional Gradients

We give the first statistical-query lower bounds for agnostically learning any non-polynomial activation with respect to Gaussian marginals (e.g., ReLU, sigmoid, sign). For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statistical-query algorithm with tolerance $n^{-\Theta(\epsilon^{-1/2})}$ must use at least $2^{n^c} \epsilon$ queries for some constant $c > 0$, where $n$ is the dimension and $\epsilon$ is the accuracy parameter. Our results rule out general (as opposed to correlational) SQ learning algorithms, which is unusual for real-valued learning problems. Our techniques involve a gradient boosting procedure for "amplifying" recent lower bounds due to Diakonikolas et al. (COLT 2020) and Goel et al. (ICML 2020) on the SQ dimension of functions computed by two-layer neural networks. The crucial new ingredient is the use of a nonstandard convex functional during the boosting procedure. This also yields a best-possible reduction between two commonly studied models of learning: agnostic learning and probabilistic concepts.

[1]  Ilias Diakonikolas,et al.  Approximation Schemes for ReLU Regression , 2020, COLT.

[2]  Vitaly Feldman,et al.  A Complete Characterization of Statistical Query Learning with Applications to Evolvability , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[3]  Jeffrey C. Jackson,et al.  An efficient membership-query algorithm for learning DNF with respect to the uniform distribution , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[4]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[5]  John Wilmes,et al.  Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds , 2018, COLT.

[6]  Alexandr Andoni,et al.  Attribute-efficient learning of monomials over highly-correlated variables , 2019, ALT.

[7]  Adam R. Klivans,et al.  Time/Accuracy Tradeoffs for Learning a ReLU with respect to Gaussian Marginals , 2019, NeurIPS.

[8]  Yuan Cao,et al.  Agnostic Learning of a Single Neuron with Gradient Descent , 2020, NeurIPS.

[9]  Pravesh Kothari,et al.  Embedding Hard Learning Problems into Gaussian Space , 2014, Electron. Colloquium Comput. Complex..

[10]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[11]  Abhishek Panigrahi,et al.  Effect of Activation Functions on the Training of Overparametrized Neural Nets , 2019, ICLR.

[12]  Daniel M. Kane,et al.  Near-Optimal SQ Lower Bounds for Agnostically Learning Halfspaces and ReLUs under Gaussian Marginals , 2020, NeurIPS.

[13]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[14]  Daniel M. Kane,et al.  Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks , 2020, COLT.

[15]  Haipeng Luo,et al.  Online Gradient Boosting , 2015, NIPS.

[16]  Vitaly Feldman,et al.  Distribution-Specific Agnostic Boosting , 2009, ICS.

[17]  Le Song,et al.  On the Complexity of Learning Neural Networks , 2017, NIPS.

[18]  Gilad Yehudai,et al.  Learning a Single Neuron with Gradient Methods , 2020, COLT 2020.

[19]  C. Fonseca,et al.  Basic trigonometric power sums with applications , 2016, 1601.07839.

[20]  Alexandr Andoni,et al.  Learning Sparse Polynomial Functions , 2014, SODA.

[21]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent , 1999, NIPS.

[22]  Daniel M. Kane,et al.  Statistical Query Lower Bounds for Robust Estimation of High-Dimensional Gaussians and Gaussian Mixtures , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[23]  E. Hille,et al.  Contributions to the theory of Hermitian series. II. The representation problem , 1940 .

[24]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[25]  Varun Kanade,et al.  Reliably Learning the ReLU in Polynomial Time , 2016, COLT.

[26]  Rocco A. Servedio,et al.  Agnostically learning halfspaces , 2005, 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05).

[27]  Alex M. Andrew,et al.  Boosting: Foundations and Algorithms , 2012 .

[28]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[29]  Noam Nisan,et al.  Constant depth circuits, Fourier transform, and learnability , 1989, 30th Annual Symposium on Foundations of Computer Science.

[30]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[31]  Daniel M. Kane,et al.  Bounded Independence Fools Degree-2 Threshold Functions , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[32]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[33]  John P. Boyd,et al.  Asymptotic coefficients of hermite function series , 1984 .

[34]  Adam R. Klivans,et al.  Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent , 2020, ICML.

[35]  Adam Tauman Kalai,et al.  Potential-Based Agnostic Boosting , 2009, NIPS.