Optimal Approximation Rate of ReLU Networks in terms of Width and Depth

Abstract This paper concentrates on the approximation power of deep feed-forward neural networks in terms of width and depth. It is proved by construction that ReLU networks with width O ( max ⁡ { d ⌊ N 1 / d ⌋ , N + 2 } ) and depth O ( L ) can approximate a Holder continuous function on [ 0 , 1 ] d with an approximation rate O ( λ d ( N 2 L 2 ln ⁡ N ) − α / d ) , where α ∈ ( 0 , 1 ] and λ > 0 are Holder order and constant, respectively. Such a rate is optimal up to a constant in terms of width and depth separately, while existing results are only nearly optimal without the logarithmic factor in the approximation rate. More generally, for an arbitrary continuous function f on [ 0 , 1 ] d , the approximation rate becomes O ( d ω f ( ( N 2 L 2 ln ⁡ N ) − 1 / d ) ) , where ω f ( ⋅ ) is the modulus of continuity. We also extend our analysis to any continuous function f on a bounded set. Particularly, if ReLU networks with depth 31 and width O ( N ) are used to approximate one-dimensional Lipschitz continuous functions on [ 0 , 1 ] with a Lipschitz constant λ > 0 , the approximation rate in terms of the total number of parameters, W = O ( N 2 ) , becomes O ( λ W ln ⁡ W ) , which has not been discovered in the literature for fixed-depth ReLU networks.

[1]  R. Srikant,et al.  Why Deep Neural Networks? , 2016, ArXiv.

[2]  Dmitry Yarotsky,et al.  Elementary superexpressive activations , 2021, ICML.

[3]  Peter L. Bartlett,et al.  Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks , 1998, Neural Computation.

[4]  Haizhao Yang,et al.  Deep ReLU networks overcome the curse of dimensionality for bandlimited functions , 2019, 1903.00735.

[5]  Mengdi Wang,et al.  Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python , 2019, J. Mach. Learn. Res..

[6]  Liwei Wang,et al.  The Expressive Power of Neural Networks: A View from the Width , 2017, NIPS.

[7]  Jinchao Xu,et al.  Optimal Approximation Rates and Metric Entropy of ReLU$^k$ and Cosine Networks , 2021, ArXiv.

[8]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[9]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[10]  Shijun Zhang,et al.  Nonlinear Approximation via Compositions , 2019, Neural Networks.

[11]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[12]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[13]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[14]  H. Whitney Analytic Extensions of Differentiable Functions Defined in Closed Sets , 1934 .

[15]  Franco Scarselli,et al.  On the Complexity of Neural Network Classifiers: A Comparison Between Shallow and Deep Architectures , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[17]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[18]  E. Weinan,et al.  A Priori Estimates of the Population Risk for Residual Networks , 2019, ArXiv.

[19]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[20]  Liang Chen,et al.  A note on the expressive power of deep rectified linear unit networks in high‐dimensional spaces , 2019, Mathematical Methods in the Applied Sciences.

[21]  Haizhao Yang,et al.  Neural Network Approximation: Three Hidden Layers Are Enough , 2020, Neural Networks.

[22]  Dmitry Yarotsky,et al.  The phase diagram of approximation rates for deep neural networks , 2019, NeurIPS.

[23]  Wu Lei A PRIORI ESTIMATES OF THE POPULATION RISK FOR TWO-LAYER NEURAL NETWORKS , 2020 .

[24]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[25]  Jinchao Xu,et al.  Approximation rates for neural networks with general activation functions , 2020, Neural Networks.

[26]  E Weinan,et al.  Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations , 2018, J. Mach. Learn. Res..

[27]  Abbas Mehrabian,et al.  Nearly-tight VC-dimension bounds for piecewise linear neural networks , 2017, COLT.

[28]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[29]  Akito Sakurai Tight Bounds for the VC-Dimension of Piecewise Polynomial Networks , 1998, NIPS.

[30]  Zuowei Shen,et al.  Deep Network with Approximation Error Being Reciprocal of Width to Power of Square Root of Depth , 2020, Neural Computation.

[31]  E Weinan,et al.  On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics , 2020, CSIAM Transactions on Applied Mathematics.

[32]  Dmitry Yarotsky,et al.  Optimal approximation of continuous functions by very deep ReLU networks , 2018, COLT.

[33]  Zuowei Shen,et al.  Deep Network Approximation Characterized by Number of Neurons , 2019, Communications in Computational Physics.

[34]  Zuowei Shen,et al.  Deep Network Approximation for Smooth Functions , 2020, ArXiv.

[35]  P. Urysohn Über die Mächtigkeit der zusammenhängenden Mengen , 1925 .

[36]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[37]  Lei Wu,et al.  A Priori Estimates of the Generalization Error for Two-layer Neural Networks , 2018, Communications in Mathematical Sciences.

[38]  Zuowei Shen,et al.  Deep Learning via Dynamical Systems: An Approximation Perspective , 2019, Journal of the European Mathematical Society.

[39]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[40]  E Weinan,et al.  A priori estimates for classification problems using neural networks , 2020, ArXiv.

[41]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[42]  E Weinan,et al.  Exponential convergence of the deep neural network approximation for analytic functions , 2018, Science China Mathematics.

[43]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[44]  Yoshua Bengio,et al.  Depth with Nonlinearity Creates No Bad Local Minima in ResNets , 2019, Neural Networks.