Efficient Algorithms for Learning Depth-2 Neural Networks with General ReLU Activations

We present polynomial time and sample efficient algorithms for learning an unknown depth-2 feedforward neural network with general ReLU activations, under mild non-degeneracy assumptions. In particular, we consider learning an unknown network of the form $f(x) = {a}^{\mathsf{T}}\sigma({W}^\mathsf{T}x+b)$, where $x$ is drawn from the Gaussian distribution, and $\sigma(t) := \max(t,0)$ is the ReLU activation. Prior works for learning networks with ReLU activations assume that the bias $b$ is zero. In order to deal with the presence of the bias terms, our proposed algorithm consists of robustly decomposing multiple higher order tensors arising from the Hermite expansion of the function $f(x)$. Using these ideas we also establish identifiability of the network parameters under minimal assumptions.

[1]  N. Nathani,et al.  Foundations of Machine Learning , 2021, Introduction to AI Techniques for Renewable Energy Systems.

[2]  Amit Daniely,et al.  From Local Pseudorandom Generators to Hardness of Learning , 2021, COLT.

[3]  Saket Saurabh,et al.  Beyond the Worst-Case Analysis of Algorithms , 2020 .

[4]  Daniel M. Kane,et al.  Small Covers for Near-Zero Sets of Polynomials and Learning Latent Variable Models , 2020, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[5]  Raghu Meka,et al.  Learning Deep ReLU Networks Is Fixed-Parameter Tractable , 2020, 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS).

[6]  Adam R. Klivans,et al.  Superpolynomial Lower Bounds for Learning One-Layer Neural Networks using Gradient Descent , 2020, ICML.

[7]  Daniel M. Kane,et al.  Algorithms and SQ Lower Bounds for PAC Learning One-Hidden-Layer ReLU Networks , 2020, COLT.

[8]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[9]  Anima Anandkumar,et al.  Spectral Learning on Matrices and Tensors , 2019, Found. Trends Mach. Learn..

[10]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[11]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[12]  Aravindan Vijayaraghavan,et al.  Smoothed Analysis in Unsupervised Learning via Decoupling , 2018, 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS).

[13]  David P. Woodruff,et al.  Learning Two Layer Rectified Neural Networks in Polynomial Time , 2018, COLT.

[14]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[15]  Zhize Li,et al.  Learning Two-layer Neural Networks with Symmetric Inputs , 2018, ICLR.

[16]  Ankur Moitra,et al.  Algorithmic Aspects of Machine Learning , 2018 .

[17]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[18]  Arthur Jacot,et al.  Neural Tangent Kernel: Convergence and Generalization in Neural Networks , 2018, NeurIPS.

[19]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[20]  Tengyu Ma,et al.  Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[21]  Adam R. Klivans,et al.  Learning Neural Networks with Two Nonlinear Layers in Polynomial Time , 2017, COLT.

[22]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[23]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[24]  Y. Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[25]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[26]  Ankur Moitra,et al.  Smoothed analysis of tensor decompositions , 2013, STOC.

[27]  Santosh S. Vempala,et al.  Fourier PCA and robust tensor decomposition , 2013, STOC.

[28]  Aditya Bhaskara,et al.  Uniqueness of Tensor Decompositions with Applications to Polynomial Identifiability , 2013, COLT.

[29]  I. Pinelis Exact Rosenthal-type bounds , 2013, 1304.4609.

[30]  T. Sanders,et al.  Analysis of Boolean Functions , 2012, ArXiv.

[31]  Eric Goralnick,et al.  Moses , 2011, Annals of Internal Medicine.

[32]  Stephen P. Boyd,et al.  Convex Optimization , 2004, IEEE Transactions on Automatic Control.

[33]  Shang-Hua Teng,et al.  Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time , 2001, STOC '01.

[34]  Charles R. Johnson,et al.  Matrix analysis , 1985 .

[35]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[36]  P. Turán,et al.  On the zeros of the polynomials of Legendre , 1950 .