Ghosts in Neural Networks: Existence, Structure and Role of Infinite-Dimensional Null Space

Overparametrization has been remarkably successful for deep learning studies. This study investigates an overlooked but important aspect of overparametrized neural networks, that is, the null components in the parameters of neural networks, or the ghosts. Since deep learning is not explicitly regularized, typical deep learning solutions contain null components. In this paper, we present a structure theorem of the null space for a general class of neural networks. Specifically, we show that any null element can be uniquely written by the linear combination of ridgelet transforms. In general, it is quite difficult to fully characterize the null space of an arbitrarily given operator. Therefore, the structure theorem is a great advantage for understanding a complicated landscape of neural network parameters. As applications, we discuss the roles of ghosts on the generalization performance of deep learning.

[1]  Sho Sonoda,et al.  Ridge Regression with Over-parametrized Two-Layer Networks Converge to Ridgelet Spectrum , 2020, AISTATS.

[2]  B. Rubin Inversion ofk-Plane Transforms via Continuous Wavelet Transforms , 1998 .

[3]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[4]  Grant M. Rotskoff,et al.  Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks , 2018, NeurIPS.

[5]  Stevan Pilipovic,et al.  The Ridgelet transform of distributions , 2013, 1306.2024.

[6]  J. Kuelbs Probability on Banach spaces , 1978 .

[7]  Konstantinos Spiliopoulos,et al.  Mean Field Analysis of Neural Networks: A Law of Large Numbers , 2018, SIAM J. Appl. Math..

[8]  Nicolas Le Roux,et al.  Convex Neural Networks , 2005, NIPS.

[9]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[10]  L. Grafakos Classical Fourier Analysis , 2010 .

[11]  Noboru Murata,et al.  An Integral Representation of Functions Using Three-layered Networks and Their Approximation Bounds , 1996, Neural Networks.

[12]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[13]  G. Shilov,et al.  Generalized Functions, Volume 1: Properties and Operations , 1967 .

[14]  Behnam Neyshabur,et al.  Implicit Regularization in Deep Learning , 2017, ArXiv.

[15]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[16]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[17]  Andrea Montanari,et al.  Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.

[18]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[19]  J. Zico Kolter,et al.  Uniform convergence may be unable to explain generalization in deep learning , 2019, NeurIPS.

[20]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[21]  Taiji Suzuki,et al.  Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network , 2019, ICLR.

[22]  Jaehoon Lee,et al.  Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.

[23]  Boris Rubin,et al.  The Calderón reproducing formula, windowed X-ray transforms, and radon transforms in LP-spaces , 1998 .

[24]  Xiaoxia Wu,et al.  Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network , 2019, ArXiv.

[25]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[26]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[27]  G. Burton Sobolev Spaces , 2013 .

[28]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[29]  Wei Hu,et al.  Width Provably Matters in Optimization for Deep Linear Neural Networks , 2019, ICML.

[30]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[31]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[32]  Philip M. Long,et al.  Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.

[33]  Nathan Srebro,et al.  A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case , 2019, ICLR.

[34]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[35]  Alfred K. Louis,et al.  Ghosts in tomography – the null space of the radon transform , 1981 .

[36]  Ohad Shamir,et al.  Size-Independent Sample Complexity of Neural Networks , 2017, COLT.

[37]  Akiyoshi Sannai,et al.  Universal Approximation Theorem for Equivariant Maps by Group CNNs , 2020, ArXiv.

[38]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.