论文信息 - The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies - 字舞流文

The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

We study the relationship between the frequency of a function and the speed at which a neural network learns it. We build on recent results that show that the dynamics of overparameterized neural networks trained with gradient descent can be well approximated by a linear system. When normalized training data is uniformly distributed on a hypersphere, the eigenfunctions of this linear system are spherical harmonic functions. We derive the corresponding eigenvalues for each frequency after introducing a bias term in the model. This bias term had been omitted from the linear network model without significantly affecting previous theoretical results. However, we show theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies. Our results lead to specific predictions of the time it will take a network to learn functions of varying frequency. These predictions match the empirical behavior of both shallow and deep networks.

Ronen Basri | Yoni Kasten | David Jacobs | Shira Kritchman | R. Basri | D. Jacobs | Yoni Kasten | S. Kritchman | Y. Kasten

[1] Yoshua Bengio,et al. On the Spectral Bias of Neural Networks , 2018, ICML.

[2] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[3] Julien Mairal,et al. On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[4] Ohad Shamir,et al. The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[5] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[6] David Tse,et al. A Spectral Approach to Generalization and Optimization in Neural Networks , 2018 .

[7] Yoshua Bengio,et al. On the Spectral Bias of Deep Neural Networks , 2018, ArXiv.

[8] Jocelyn Quaintance,et al. Spherical Harmonics and Linear Representations of Lie Groups , 2009 .

[9] Yuan Cao,et al. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[10] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[11] Matus Telgarsky,et al. Risk and parameter convergence of logistic regression , 2018, ArXiv.

[12] Nathan Srebro,et al. Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.

[13] Shai Shalev-Shwartz,et al. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[14] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.

[15] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[16] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17] Razvan Pascanu,et al. On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[18] Francis R. Bach,et al. Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..

[19] Ruosong Wang,et al. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[20] Zhi-Qin John Xu,et al. Understanding training and generalization in deep learning by Fourier analysis , 2018, ArXiv.

[21] Sanjeev Arora,et al. On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.

[22] Matus Telgarsky,et al. Gradient descent aligns the layers of deep linear networks , 2018, ICLR.

[23] Matus Telgarsky,et al. Benefits of Depth in Neural Networks , 2016, COLT.

[24] Samet Oymak,et al. Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.

[25] Zheng Ma,et al. Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , 2019, Communications in Computational Physics.

[26] Ohad Shamir,et al. Weight Sharing is Crucial to Succesful Optimization , 2017, ArXiv.

[27] Andrea Montanari,et al. Linearized two-layers neural networks in high dimension , 2019, The Annals of Statistics.

[28] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[29] Le Song,et al. Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[30] Joan Bruna,et al. Intriguing properties of neural networks , 2013, ICLR.

[31] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[32] Francis Bach,et al. On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[33] Yuanzhi Li,et al. On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.