How Powerful are Shallow Neural Networks with Bandlimited Random Weights?

We investigate the expressive power of depth-2 bandlimited random neural networks. A random net is a neural network where the hidden layer parameters are frozen with random assignment, and only the output layer parameters are trained by loss minimization. Using random weights for a hidden layer is an effective method to avoid non-convex optimization in standard gradient descent learning. It has also been adopted in recent deep learning theories. Despite the well-known fact that a neural network is a universal approximator, in this study, we mathematically show that when hidden parameters are distributed in a bounded domain, the network may not achieve zero approximation error. In particular, we derive a new nontrivial approximation error lower bound. The proof utilizes the technique of ridgelet analysis, a harmonic analysis method designed for neural networks. This method is inspired by fundamental principles in classical signal processing, specifically the idea that signals with limited bandwidth may not always be able to perfectly recreate the original signal. We corroborate our theoretical results with various simulation studies, and generally, two main take-home messages are offered: (i) Not any distribution for selecting random weights is feasible to build a universal approximator; (ii) A suitable assignment of random weights exists but to some degree is associated with the complexity of the target function.

[1]  Sho Sonoda,et al.  Ghosts in Neural Networks: Existence, Structure and Role of Infinite-Dimensional Null Space , 2021, ArXiv.

[2]  Rocco A. Servedio,et al.  On the Approximation Power of Two-Layer Networks of Random ReLUs , 2021, COLT.

[3]  Robert D. Nowak,et al.  Banach Space Representer Theorems for Neural Networks and Ridge Splines , 2020, J. Mach. Learn. Res..

[4]  E Weinan,et al.  The Slow Deterioration of the Generalization Error of the Random Feature Model , 2020, MSML.

[5]  Sho Sonoda,et al.  Ridge Regression with Over-parametrized Two-Layer Networks Converge to Ridgelet Spectrum , 2020, AISTATS.

[6]  Zhenyu Liao,et al.  A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent , 2020, NeurIPS.

[7]  Xiaolin Huang,et al.  Random Features for Kernel Approximation: A Survey on Algorithms, Theory, and Beyond , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[9]  Gilad Yehudai,et al.  Proving the Lottery Ticket Hypothesis: Pruning is All You Need , 2020, ICML.

[10]  Jeffrey Pennington,et al.  Nonlinear random matrix theory for deep learning , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[11]  Matus Telgarsky,et al.  Neural tangent kernels, transportation mappings, and universal approximation , 2019, ICLR.

[12]  Nathan Srebro,et al.  A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case , 2019, ICLR.

[13]  Evgeny Osipov,et al.  Density Encoding Enables Resource-Efficient Randomly Connected Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Ming Li,et al.  2-D Stochastic Configuration Networks for Image Data Analytics , 2019, IEEE Transactions on Cybernetics.

[15]  Andrea Montanari,et al.  Limitations of Lazy Training of Two-layers Neural Networks , 2019, NeurIPS.

[16]  Gilad Yehudai,et al.  On the Power and Limitations of Random Features for Understanding Neural Networks , 2019, NeurIPS.

[17]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[18]  Ming Li,et al.  Robust stochastic configuration networks with maximum correntropy criterion for uncertain data regression , 2019, Inf. Sci..

[19]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[20]  Zhenyu Liao,et al.  On the Spectrum of Random Features Maps of High Dimensional Data , 2018, ICML.

[21]  Noboru Murata,et al.  The global optimum of shallow neural network is attained by ridgelet transform , 2018 .

[22]  Taiji Suzuki,et al.  Fast generalization error bound of deep learning from a kernel perspective , 2018, AISTATS.

[23]  Xizhao Wang,et al.  A review on neural networks with random weights , 2018, Neurocomputing.

[24]  Ming Li,et al.  Insights into randomized algorithms for neural networks: Practical issues and common pitfalls , 2017, Inf. Sci..

[25]  Dianhui Wang,et al.  Randomness in neural networks: an overview , 2017, WIREs Data Mining Knowl. Discov..

[26]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[27]  Tengyu Ma,et al.  On the Ability of Neural Nets to Express Distributions , 2017, COLT.

[28]  Zhenyu Liao,et al.  A Random Matrix Approach to Neural Networks , 2017, ArXiv.

[29]  Ming Li,et al.  Robust stochastic configuration networks with kernel density estimation for uncertain data regression , 2017, Inf. Sci..

[30]  Dianhui Wang,et al.  Stochastic Configuration Networks: Fundamentals and Algorithms , 2017, IEEE Transactions on Cybernetics.

[31]  Nicolas Macris,et al.  Mutual Information and Optimality of Approximate Message-Passing in Random Linear Estimation , 2017, IEEE Transactions on Information Theory.

[32]  Yoshua Bengio,et al.  On Random Weights for Texture Generation in One Layer Neural Networks , 2016, ArXiv.

[33]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[34]  Andrew R. Barron,et al.  Approximation by Combinations of ReLU and Squared ReLU Ridge Functions With $\ell^1$ and $\ell^0$ Controls , 2016, IEEE Transactions on Information Theory.

[35]  Ivan Tyukin,et al.  Approximation with random bases: Pro et Contra , 2015, Inf. Sci..

[36]  Guillermo Sapiro,et al.  Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy? , 2015, IEEE Transactions on Signal Processing.

[37]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[38]  Noboru Murata,et al.  Sampling Hidden Parameters from Oracle Distribution , 2014, ICANN.

[39]  Stevan Pilipovic,et al.  The Ridgelet transform of distributions , 2013, 1306.2024.

[40]  Giorgio Gnecco,et al.  A Comparison between Fixed-Basis and Variable-Basis Schemes for Function Approximation and Functional Optimization , 2012, J. Appl. Math..

[41]  Christos Boutsidis,et al.  Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[42]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[43]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[44]  Herbert Jaeger,et al.  Reservoir computing approaches to recurrent neural network training , 2009, Comput. Sci. Rev..

[45]  Stphane Mallat,et al.  A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way , 2008 .

[46]  A. Rahimi,et al.  Uniform approximation of functions with random bases , 2008, 2008 46th Annual Allerton Conference on Communication, Control, and Computing.

[47]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[48]  D. Donoho Emerging applications of geometric multiscale analysis , 2002, math/0212395.

[49]  E. Candès Harmonic Analysis of Neural Networks , 1999 .

[50]  Boris Rubin,et al.  The Calderón reproducing formula, windowed X-ray transforms, and radon transforms in LP-spaces , 1998 .

[51]  Noboru Murata,et al.  An Integral Representation of Functions Using Three-layered Networks and Their Approximation Bounds , 1996, Neural Networks.

[52]  Yoh-Han Pao,et al.  Stochastic choice of basis functions in adaptive function approximation and the functional-link net , 1995, IEEE Trans. Neural Networks.

[53]  Dejan J. Sobajic,et al.  Learning and generalization characteristics of the random vector Functional-link net , 1994, Neurocomputing.

[54]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[55]  Robert P. W. Duin,et al.  Feedforward neural networks with random weights , 1992, Proceedings., 11th IAPR International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recognition Methodology and Systems.

[56]  Yoshifusa Ito,et al.  Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory , 1991, Neural Networks.

[57]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[58]  B. Irie,et al.  Capabilities of three-layered perceptrons , 1988, IEEE 1988 International Conference on Neural Networks.

[59]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[60]  Lei Wu,et al.  A priori estimates of the population risk for two-layer neural networks , 2018, Communications in Mathematical Sciences.

[61]  C. L. Philip Chen,et al.  Broad Learning System: An Effective and Efficient Incremental Learning System Without the Need for Deep Architecture , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[62]  Marcello Sanguineti,et al.  Approximating Multivariable Functions by Feedforward Neural Nets , 2013, Handbook on Neural Information Processing.

[63]  Yue Joseph Wang,et al.  Nonlinear System Modeling With Random Matrices: Echo State Networks Revisited , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[64]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[65]  Herbert Jaeger,et al.  Adaptive Nonlinear System Identification with Echo State Networks , 2002, NIPS.

[66]  Marcello Sanguineti,et al.  Comparison of worst case errors in linear and neural network approximation , 2002, IEEE Trans. Inf. Theory.

[67]  Radford M. Neal Bayesian learning for neural networks , 1995 .

[68]  S. M. Carroll,et al.  Construction of neural nets using the radon transform , 1989, International 1989 Joint Conference on Neural Networks.

[69]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .