A Corrective View of Neural Networks: Representation, Memorization and Learning

We develop a corrective mechanism for neural network approximation: the total available non-linear units are divided into multiple groups and the first group approximates the function under consideration, the second group approximates the error in approximation produced by the first group and corrects it, the third group approximates the error produced by the first and second groups together and so on. This technique yields several new representation and learning results for neural networks. First, we show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels for arbitrary points under under Euclidean distance separation condition using $\tilde{O}(n)$ ReLUs which is optimal in $n$ up to logarithmic factors. Next, we give a powerful representation result for two-layer neural networks with ReLUs and smoothed ReLUs which can achieve a squared error of at most $\epsilon$ with $O(C(a,d)\epsilon^{-1/(a+1)})$ for $a \in \mathbb{N}\cup\{0\}$ when the function is smooth enough (roughly when it has $\Theta(ad)$ bounded derivatives). In certain cases $d$ can be replaced with effective dimension $q \ll d$. Previous results of this type implement Taylor series approximation using deep architectures. We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions. Lastly, we obtain the first $O(\mathrm{subpoly}(1/\epsilon))$ upper bound on the number of neurons required for a two layer network to learn low degree polynomials up to squared error $\epsilon$ via gradient descent. Even though deep networks can express these polynomials with $O(\mathrm{polylog}(1/\epsilon))$ neurons, the best learning bounds on this problem require $\mathrm{poly}(1/\epsilon)$ neurons.

[1]  Amit Daniely,et al.  Neural Networks Learning and Memorization with (almost) no Over-Parameterization , 2019, NeurIPS.

[2]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[3]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[4]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[5]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[6]  Samet Oymak,et al.  Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks , 2019, IEEE Journal on Selected Areas in Information Theory.

[7]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[8]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[9]  Lei Wu,et al.  Barron Spaces and the Compositional Function Spaces for Neural Network Models , 2019, ArXiv.

[10]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[11]  Jiaoyang Huang,et al.  Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes , 2019, 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[12]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[13]  R. Srikant,et al.  Why Deep Neural Networks for Function Approximation? , 2016, ICLR.

[14]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[15]  Guy Bresler,et al.  Sharp Representation Theorems for ReLU Networks with Precise Dependence on Depth , 2020, NeurIPS.

[16]  Yin Tat Lee,et al.  Network size and weights size for memorization with two-layers neural networks , 2020, ArXiv.

[17]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[18]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[19]  Abhishek Panigrahi,et al.  Effect of Activation Functions on the Training of Overparametrized Neural Nets , 2019, ICLR.

[20]  Andrew R. Barron,et al.  Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls , 2016, IEEE Transactions on Information Theory.

[21]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[22]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[23]  Amit Daniely,et al.  Depth Separation for Neural Networks , 2017, COLT.

[24]  Bo Li,et al.  Better Approximations of High Dimensional Smooth Functions by Deep Neural Networks with Rectified Power Units , 2019, Communications in Computational Physics.

[25]  Alexandr Andoni,et al.  Learning Polynomials with Neural Networks , 2014, ICML.

[26]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[27]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[28]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[29]  Matus Telgarsky,et al.  Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks , 2020, ICLR.

[30]  Xin Yang,et al.  Quadratic Suffices for Over-parametrization via Matrix Chernoff Bound , 2019, ArXiv.

[31]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[32]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[33]  Yanpeng Li,et al.  Improving deep neural networks using softplus units , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[34]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[35]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[36]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[37]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[38]  A. Rahimi,et al.  Uniform approximation of functions with random bases , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[39]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[40]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[41]  Matus Telgarsky,et al.  Approximation power of random neural networks , 2019, ArXiv.

[42]  Ambuj Tewari,et al.  On the Approximation Properties of Random ReLU Features , 2018 .

[43]  Matus Telgarsky,et al.  Neural tangent kernels, transportation mappings, and universal approximation , 2020, ICLR.

[44]  Mark Sellke,et al.  Approximating Continuous Functions by ReLU Nets of Minimal Width , 2017, ArXiv.

[45]  Yuan Cao,et al.  How Much Over-parameterization Is Sufficient to Learn Deep ReLU Networks? , 2019, ICLR.