The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects
暂无分享,去创建一个
Zhanxing Zhu | Jinwen Ma | Bing Yu | Lei Wu | Jingfeng Wu | Lei Wu | Jinwen Ma | Bin Yu | Zhanxing Zhu | Jingfeng Wu
[1] L. Bottou. Stochastic Gradient Learning in Neural Networks , 1991 .
[2] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.
[3] Shai Shalev-Shwartz,et al. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.
[4] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.
[5] Yuchen Zhang,et al. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.
[6] Kai Zheng,et al. Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints , 2017, COLT.
[7] Zhanxing Zhu,et al. Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling , 2015, NIPS.
[8] Nathan Srebro,et al. Exploring Generalization in Deep Learning , 2017, NIPS.
[9] Zhanxing Zhu,et al. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.
[10] Naftali Tishby,et al. Opening the Black Box of Deep Neural Networks via Information , 2017, ArXiv.
[11] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[12] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.
[13] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[14] Quoc V. Le,et al. A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.
[15] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.
[16] Stefano Soatto,et al. Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).
[17] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.
[18] E Weinan,et al. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.
[19] Thomas Hofmann,et al. Escaping Saddles with Stochastic Gradients , 2018, ICML.
[20] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[21] Ahn,et al. Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .
[22] Wenqing Hu,et al. On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.
[23] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..
[24] B. Øksendal. Stochastic Differential Equations , 1985 .
[25] Y. Pawitan. In all likelihood : statistical modelling and inference using likelihood , 2002 .
[26] Tianqi Chen,et al. Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.