On the Power and Limitations of Random Features for Understanding Neural Networks

Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks. We then argue that despite the impressive positive results, random feature approaches are also inherently limited in what they can explain. In particular, we rigorously show that random features cannot be used to learn even a single ReLU neuron with standard Gaussian inputs, unless the network size (or magnitude of the weights) is exponentially large. Since a single neuron is learnable with gradient-based methods, we conclude that we are still far from a satisfying general explanation for the empirical success of neural networks.

[1]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[2]  Andrea Montanari,et al.  Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[3]  Mahdi Soltanolkotabi,et al.  Learning ReLUs via Gradient Descent , 2017, NIPS.

[4]  Ambuj Tewari,et al.  On the Approximation Properties of Random ReLU Features , 2018 .

[5]  Yuchen Zhang,et al.  L1-regularized Neural Networks are Improperly Learnable in Polynomial Time , 2015, ICML.

[6]  Jason D. Lee,et al.  On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[7]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[8]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[9]  Quanquan Gu,et al.  Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks , 2019, AAAI.

[10]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[11]  B. Gelbaum,et al.  Bases of tensor products of Banach spaces , 1961 .

[12]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[13]  Yuanzhi Li,et al.  Can SGD Learn Recurrent Neural Networks with Provable Generalization? , 2019, NeurIPS.

[14]  Amir Globerson,et al.  Why do Larger Models Generalize Better? A Theoretical Perspective via the XOR Problem , 2018, ICML.

[15]  Gang Wang,et al.  Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[16]  Ohad Shamir,et al.  Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[17]  Amir Globerson,et al.  Over-parameterization Improves Generalization in the XOR Detection Problem , 2018, ArXiv.

[18]  A. Rahimi,et al.  Uniform approximation of functions with random bases , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[19]  Ambuj Tewari,et al.  Random ReLU Features: Universality, Approximation, and Composition , 2018, ArXiv.

[20]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[21]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[22]  Andrew R. Barron,et al.  Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls , 2016, IEEE Transactions on Information Theory.

[23]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[24]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[25]  Yoram Singer,et al.  Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity , 2016, NIPS.

[26]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[27]  Ohad Shamir,et al.  Distribution-Specific Hardness of Learning Neural Networks , 2016, J. Mach. Learn. Res..

[28]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[29]  Hongyang Zhang,et al.  Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.

[30]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[31]  Shai Shalev-Shwartz,et al.  SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[32]  Ohad Shamir,et al.  On the Quality of the Initial Basin in Overspecified Neural Networks , 2015, ICML.

[33]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[34]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[35]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[36]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[37]  Yuanzhi Li,et al.  Algorithmic Regularization in Over-parameterized Matrix Recovery , 2017, ArXiv.

[38]  M. Ledoux The concentration of measure phenomenon , 2001 .

[39]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[40]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[41]  Yuan Cao,et al.  A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks , 2019, ArXiv.

[42]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[43]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[44]  Elad Hoffer,et al.  Exponentially vanishing sub-optimal local minima in multilayer neural networks , 2017, ICLR.

[45]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.