Towards Understanding the Condensation of Two-layer Neural Networks at Initial Training

Studying the implicit regularization effect of the nonlinear training dynamics of neural networks (NNs) is important for understanding why over-parameterized neural networks often generalize well on real dataset. Empirically, for two-layer NN, existing works have shown that input weights of hidden neurons (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) condense on isolated orientations with a small initialization. The condensation dynamics implies that NNs can learn features from the training data with a network configuration effectively equivalent to a much smaller network during the training. In this work, we show that the multiple roots of activation function at origin (referred as “multiplicity”) is a key factor for understanding the condensation at the initial stage of training. Our experiments of multilayer networks suggest that the maximal number of condensed orientations is twice the multiplicity of the activation function used. Our theoretical analysis of two-layer networks confirms experiments for two cases, one is for the activation function of multiplicity one, which contains many common activation functions, and the other is for the one-dimensional input. This work makes a step towards understanding how small initialization implicitly leads NNs to condensation at initial training stage, which lays a foundation for the future study of the nonlinear dynamics of NNs and its implicit regularization effect at a later stage of training.

[1]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[2]  Zhi-Qin John Xu,et al.  Multi-scale Deep Neural Network (MscaleDNN) for Solving Poisson-Boltzmann Equation in Complex Domains , 2020, ArXiv.

[3]  Wenzhong Zhang,et al.  Multi-scale Deep Neural Network (MscaleDNN) Methods for Oscillatory Stokes Flows in Complex Domains , 2020, ArXiv.

[4]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[5]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[6]  Jonathan T. Barron,et al.  Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains , 2020, NeurIPS.

[7]  Xiaoguang Li,et al.  A Phase Shift Deep Neural Network for High Frequency Approximation and Wave Problems , 2020, SIAM J. Sci. Comput..

[8]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.

[9]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[10]  Xiangjian He,et al.  DRL-GAN: Dual-Stream Representation Learning GAN for Low-Resolution Image Classification in UAV Applications , 2021, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[11]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[12]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[13]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[14]  Xiaoguang Li,et al.  A Phase Shift Deep Neural Network for High Frequency Wave Equations in Inhomogeneous Media , 2019, ArXiv.

[15]  Zheng Ma,et al.  A type of generalization error induced by initialization in deep neural networks , 2019, MSML.

[16]  George Em Karniadakis,et al.  Adaptive activation functions accelerate convergence in deep and physics-informed neural networks , 2019, J. Comput. Phys..

[17]  Zhi-Qin John Xu,et al.  Training behavior of deep neural network in frequency domain , 2018, ICONIP.

[18]  Sylvain Gelly,et al.  Gradient Descent Quantizes ReLU Network Features , 2018, ArXiv.

[19]  George Em Karniadakis,et al.  Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness , 2019, Neural Networks.

[20]  Lei Zhang,et al.  A Multi-Scale DNN Algorithm for Nonlinear Elliptic Equations with Multiple Scales , 2020, Communications in Computational Physics.

[21]  Francis Bach,et al.  A Note on Lazy Training in Supervised Differentiable Programming , 2018, ArXiv.

[22]  Zhi-Qin John Xu,et al.  Multi-scale Deep Neural Networks for Solving High Dimensional PDEs , 2019, ArXiv.

[23]  L. Breiman Reflections After Refereeing Papers for NIPS , 2018 .

[24]  Zheng Ma,et al.  Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , 2019, Communications in Computational Physics.

[25]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[26]  Lei Wu,et al.  A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics , 2019, Science China Mathematics.

[27]  Grant M. Rotskoff,et al.  Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks , 2018, NeurIPS.

[28]  Zhengxing Chen,et al.  Band-limited Soft Actor Critic Model , 2020, ArXiv.

[29]  B. Solenthaler,et al.  Frequency-Aware Reconstruction of Fluid Simulations with Generative Networks , 2019, Eurographics.

[30]  Geoffrey E. Hinton,et al.  Neural Additive Models: Interpretable Machine Learning with Neural Nets , 2020, NeurIPS.

[31]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[32]  Bo Dai,et al.  Focal Frequency Loss for Generative Models , 2020, ArXiv.

[33]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[34]  Giulio Biroli,et al.  An analytic theory of shallow networks dynamics for hinge loss classification , 2020, NeurIPS.