Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls

We establish <inline-formula> <tex-math notation="LaTeX">$L^{\infty } $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$L^{2} $ </tex-math></inline-formula> error bounds for functions of many variables that are approximated by linear combinations of rectified linear unit (ReLU) and squared ReLU ridge functions with <inline-formula> <tex-math notation="LaTeX">$\ell ^{1} $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$\ell ^{0} $ </tex-math></inline-formula> controls on their inner and outer parameters. With the squared ReLU ridge function, we show that the <inline-formula> <tex-math notation="LaTeX">$L^{2} $ </tex-math></inline-formula> approximation error is inversely proportional to the inner layer <inline-formula> <tex-math notation="LaTeX">$\ell ^{0} $ </tex-math></inline-formula> sparsity and it need only be sublinear in the outer layer <inline-formula> <tex-math notation="LaTeX">$\ell ^{0} $ </tex-math></inline-formula> sparsity. Our constructions are obtained using a variant of the Maurey–Jones–Barron probabilistic method, which can be interpreted as either stratified sampling with proportionate allocation or two-stage cluster sampling. We also provide companion error lower bounds that reveal near optimality of our constructions. Despite the sparsity assumptions, we showcase the richness and flexibility of these ridge combinations by defining a large family of functions, in terms of certain spectral conditions, that are particularly well approximated by them.

[1]  Jason M. Klusowski,et al.  Risk Bounds for High-dimensional Ridge Function Combinations Including Neural Networks , 2016, 1607.01434.

[2]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[3]  Y. Makovoz Uniform Approximation by Neural Networks , 1998 .

[4]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[5]  Andrew R. Barron,et al.  Approximation and estimation bounds for artificial neural networks , 2004, Machine Learning.

[6]  Y. Makovoz Random Approximants and Neural Networks , 1996 .

[7]  Andrew R. Barron,et al.  Minimax lower bounds for ridge combinations including neural nets , 2017, 2017 IEEE International Symposium on Information Theory (ISIT).

[8]  Marcello Sanguineti,et al.  Estimates of covering numbers of convex sets with slowly decaying orthogonal subsets , 2007, Discret. Appl. Math..

[9]  Martin J. Wainwright,et al.  Learning Halfspaces and Neural Networks with Random Initialization , 2015, ArXiv.

[10]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[11]  Andrew R. Barron,et al.  A Better Approximation for Balls , 2000 .

[12]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .

[13]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[14]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[15]  J. Neyman On the Two Different Aspects of the Representative Method: the Method of Stratified Sampling and the Method of Purposive Selection , 1934 .

[16]  Stratis Ioannidis,et al.  Learning Combinations of Sigmoids Through Gradient Estimation , 2017, ArXiv.

[17]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[18]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[19]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[20]  Halbert White,et al.  Sup-norm approximation bounds for networks through probabilistic methods , 1995, IEEE Trans. Inf. Theory.

[21]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .