On the Spectral Bias of Deep Neural Networks

It is well known that over-parametrized deep neural networks (DNNs) are an overly expressive class of functions that can memorize even random data with $100\%$ training accuracy. This raises the question why they do not easily overfit real data. To answer this question, we study deep networks using Fourier analysis. We show that deep networks with finite weights (or trained for finite number of steps) are inherently biased towards representing smooth functions over the input space. Specifically, the magnitude of a particular frequency component ($k$) of deep ReLU network function decays at least as fast as $\mathcal{O}(k^{-2})$, with width and depth helping polynomially and exponentially (respectively) in modeling higher frequencies. This shows for instance why DNNs cannot perfectly \textit{memorize} peaky delta-like functions. We also show that DNNs can exploit the geometry of low dimensional data manifolds to approximate complex functions that exist along the manifold with simple functions when seen with respect to the input space. As a consequence, we find that all samples (including adversarial samples) classified by a network to belong to a certain class are connected by a path such that the prediction of the network along that path does not change. Finally we find that DNN parameters corresponding to functions with higher frequency components occupy a smaller volume in the parameter.

[1]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[2]  G. Lugosi,et al.  Consistency of the k-Nearest Neighbor Rule , 1996 .

[3]  Haw-minn Lu Geometric Properties of Image Manifolds , 1996 .

[4]  H. Jónsson,et al.  Nudged elastic band method for finding minimum energy paths of transitions , 1998 .

[5]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[6]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[7]  Barbara Hammer,et al.  A Note on the Universal Approximation Capability of Support Vector Machines , 2003, Neural Processing Letters.

[8]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[9]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[10]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[11]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[12]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[13]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[14]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  E. L. Kolsbjerg,et al.  An automated nudged elastic band method. , 2016, The Journal of chemical physics.

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Surya Ganguli,et al.  Exponential expressivity in deep neural networks through transient chaos , 2016, NIPS.

[18]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[19]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[20]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[21]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[22]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[23]  Quoc V. Le,et al.  Understanding Generalization and Stochastic Gradient Descent , 2017 .

[24]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[25]  Mikhail Belkin,et al.  Diving into the shallows: a computational perspective on large-scale shallow learning , 2017, NIPS.

[26]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[27]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[28]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[29]  Yoshua Bengio,et al.  A Walk with SGD , 2018, ArXiv.

[30]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[31]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[32]  Michael Unser,et al.  A representer theorem for deep neural networks , 2018, J. Mach. Learn. Res..

[33]  Andrew M. Saxe,et al.  High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.