Embedding Principle of Loss Landscape of Deep Neural Networks

Understanding the structure of loss landscape of deep neural networks (DNNs) is obviously important. In this work, we prove an embedding principle that the loss landscape of a DNN "contains" all the critical points of all the narrower DNNs. More precisely, we propose a critical embedding such that any critical point, e.g., local or global minima, of a narrower DNN can be embedded to a critical point/affine subspace of the target DNN with higher degeneracy and preserving the DNN output function. Note that, given any training data, differentiable loss function and differentiable activation function, this embedding structure of critical points holds. This general structure of DNNs is starkly different from other nonconvex problems such as protein-folding. Empirically, we find that a wide DNN is often attracted by highly-degenerate critical points that are embedded from narrow DNNs. The embedding principle provides a new perspective to study the general easy optimization of wide DNNs and unravels a potential implicit low-complexity regularization during the training. Overall, our work provides a skeleton for the study of loss landscape of DNNs and its implication, by which a more exact and comprehensive understanding can be anticipated in the near future.

[1]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[2]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[3]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[4]  Yoh-ichi Mototake,et al.  Semi-flat minima and saddle points by embedding neural networks to overparameterization , 2019, NeurIPS.

[5]  Joan Bruna,et al.  Gradient Dynamics of Shallow Univariate ReLU Networks , 2019, NeurIPS.

[6]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[7]  George Em Karniadakis,et al.  Quantifying the generalization error in deep learning in terms of data distribution and neural network smoothness , 2019, Neural Networks.

[8]  M. Wyart,et al.  Disentangling feature and lazy training in deep neural networks , 2019, Journal of Statistical Mechanics: Theory and Experiment.

[9]  Joar Skalse,et al.  Neural networks are a priori biased towards Boolean functions with low entropy , 2019, ArXiv.

[10]  Grant M. Rotskoff,et al.  Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks , 2018, NeurIPS.

[11]  E Weinan,et al.  The Quenching-Activation Behavior of the Gradient Descent Dynamics for Two-layer Neural Network Models , 2020, ArXiv.

[12]  Shilin He,et al.  Assessing the Bilingual Knowledge Learned by Neural Machine Translation Models , 2020, ArXiv.

[13]  L. Breiman Reflections After Refereeing Papers for NIPS , 2018 .

[14]  Zheng Ma,et al.  A type of generalization error induced by initialization in deep neural networks , 2019, MSML.

[15]  Zheng Ma,et al.  Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks , 2019, Communications in Computational Physics.

[16]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[17]  Yang Yuan,et al.  Asymmetric Valleys: Beyond Sharp and Flat Local Minima , 2019, NeurIPS.

[18]  Wulfram Gerstner,et al.  Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances , 2021, ICML.

[19]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[20]  Yaoyu Zhang,et al.  Embedding Principle: a hierarchical structure of loss landscape of deep neural networks , 2021, Journal of Machine Learning.

[21]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[22]  Lei Wu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[23]  Lei Wu,et al.  A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gradient descent dynamics , 2019, Science China Mathematics.

[24]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[25]  Surya Ganguli,et al.  Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel , 2020, NeurIPS.

[26]  Ruosong Wang,et al.  On Exact Computation with an Infinitely Wide Neural Net , 2019, NeurIPS.

[27]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[28]  Yaim Cooper Global Minima of Overparameterized Neural Networks , 2021, SIAM J. Math. Data Sci..

[29]  Yann LeCun,et al.  Singularity of the Hessian in Deep Learning , 2016, ArXiv.

[30]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[31]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[32]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[33]  Lei Wu,et al.  Towards a Mathematical Understanding of Neural Network-Based Machine Learning: what we know and what we don't , 2020, CSIAM Transactions on Applied Mathematics.

[34]  Michael Carbin,et al.  The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks , 2018, ICLR.

[35]  Zhi-Qin John Xu,et al.  Training behavior of deep neural network in frequency domain , 2018, ICONIP.

[36]  Mikhail Burtsev,et al.  Loss Landscape Sightseeing with Multi-Point Optimization , 2019, ArXiv.