论文信息 - Logarithmic landscape and power-law escape rate of SGD - 字舞流文

Logarithmic landscape and power-law escape rate of SGD

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of the SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a non-uniform transformation of the time variable. In the SDE, the gradient of the loss is replaced by that of the logarithmized loss. Consequently, we show that, near a local or global minimum, the stationary distribution Pss(θ) of the network parameters θ follows a power-law with respect to the loss function L(θ), i.e. Pss(θ) ∝ L(θ)−φ with the exponent φ specified by the mini-batch size, the learning rate, and the Hessian at the minimum. We obtain the escape rate formula from a local minimum, which is determined not by the loss barrier height ∆L = L(θs) − L(θ∗) between a minimum θ∗ and a saddle θs but by the logarithmized loss barrier height ∆ logL = log[L(θs)/L(θ∗)]. Our escape-rate formula explains an empirical fact that SGD prefers flat minima with low effective dimensions.

Liu Ziyin | Masahito Ueda | Takashi Mori | Kangqiao Liu | Takashi Mori | Liu Ziyin | Kangqiao Liu | Masakuni Ueda

[1] Renato Renner,et al. Discovering physical concepts with neural networks , 2018, Physical review letters.

[2] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[3] Zhanxing Zhu,et al. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[4] H. Kramers. Brownian motion in a field of force and the diffusion model of chemical reactions , 1940 .

[5] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[6] Alireza Seif,et al. Machine learning the thermodynamic arrow of time , 2019, Nature Physics.

[7] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[8] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[9] Vardan Papyan,et al. Measurements of Three-Level Hierarchical Structure in the Outliers in the Spectrum of Deepnet Hessians , 2019, ICML.

[10] Yuchen Zhang,et al. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[11] E Weinan,et al. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[12] Quoc V. Le,et al. A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[13] David J. C. MacKay,et al. Bayesian Model Comparison and Backprop Nets , 1991, NIPS.

[14] Pushmeet Kohli,et al. Unveiling the predictive power of static structure in glassy systems , 2020 .

[15] Zhanxing Zhu,et al. On the Noisy Gradient Descent that Generalizes as SGD , 2019, ICML.

[16] Hiroshi Nakagawa,et al. Approximation Analysis of Stochastic Gradient Langevin Dynamics by using Fokker-Planck Equation and Ito Process , 2014, ICML.

[17] Lei Wu. How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .

[18] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[19] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[20] Masashi Sugiyama,et al. A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.

[21] J. Langer. Statistical theory of the decay of metastable states , 1969 .