Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent
暂无分享,去创建一个
[1] Yann LeCun,et al. Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.
[2] Masashi Sugiyama,et al. A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.
[3] Yoshua Bengio,et al. Three Factors Influencing Minima in SGD , 2017, ArXiv.
[4] W. Ebeling. Stochastic Processes in Physics and Chemistry , 1995 .
[5] P. Hänggi,et al. Reaction-rate theory: fifty years after Kramers , 1990 .
[6] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[7] F. Bach,et al. Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.
[8] David A. McAllester,et al. A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.
[9] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.
[10] Lei Wu. How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .
[11] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[12] Levent Sagun,et al. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.
[13] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[14] Zhang Zhiyi,et al. On the Distributional Properties of Adaptive Gradients , 2021, UAI.
[15] Razvan Pascanu,et al. Revisiting Natural Gradient for Deep Networks , 2013, ICLR.
[16] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.
[17] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.
[18] J. Zico Kolter,et al. Generalization in Deep Networks: The Role of Distance from Initialization , 2019, ArXiv.
[19] Tomaso A. Poggio,et al. Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.
[20] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.
[21] Murat A. Erdogdu,et al. Hausdorff dimension, heavy tails, and generalization in neural networks , 2020, NeurIPS.
[22] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[23] Lin Xiao,et al. Understanding the Role of Momentum in Stochastic Gradient Methods , 2019, NeurIPS.
[24] Liu Ziyin,et al. On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes , 2021, ArXiv.
[25] Yurii Nesterov,et al. Lectures on Convex Optimization , 2018 .
[26] Wenqing Hu,et al. On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.
[27] L. Eon Bottou. Online Learning and Stochastic Approximations , 1998 .
[28] Zhanxing Zhu,et al. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.
[29] Quoc V. Le,et al. A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.
[30] David J. C. MacKay,et al. A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.
[31] Nicholas J. Higham,et al. Solving a Quadratic Matrix Equation by Newton's Method with Exact Line Searches , 2001, SIAM J. Matrix Anal. Appl..
[32] L. Landau,et al. statistical-physics-part-1 , 1958 .
[33] C. R. Rao,et al. Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .
[34] Zhi-Ming Ma,et al. Dynamic of Stochastic Gradient Descent with State-Dependent Noise , 2020, ArXiv.
[35] Tianqi Chen,et al. A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.
[36] Shun-ichi Amari,et al. Methods of information geometry , 2000 .
[37] 俊一 甘利. 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .
[38] Jaehoon Lee,et al. Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.
[39] Colin Wei,et al. Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.
[40] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[41] Gintare Karolina Dziugaite,et al. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.
[42] H. Kramers. Brownian motion in a field of force and the diffusion model of chemical reactions , 1940 .
[43] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.
[44] D. Sherrington. Stochastic Processes in Physics and Chemistry , 1983 .
[45] Takashi Mori,et al. Improved generalization by noise enhancement , 2020, ArXiv.
[46] J. Rissanen. A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .
[47] Jascha Sohl-Dickstein,et al. The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.
[48] A. Einstein. Über die von der molekularkinetischen Theorie der Wärme geforderte Bewegung von in ruhenden Flüssigkeiten suspendierten Teilchen [AdP 17, 549 (1905)] , 2005, Annalen der Physik.
[49] Hossein Mobahi,et al. Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.
[50] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[51] Stefano Soatto,et al. Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).
[52] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..
[53] Michael W. Mahoney,et al. Multiplicative noise and heavy tails in stochastic optimization , 2020, ICML.
[54] Liu Ziyin,et al. Logarithmic landscape and power-law escape rate of SGD , 2021, ArXiv.
[55] E Weinan,et al. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.
[56] Tomaso A. Poggio,et al. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.
[57] Francis R. Bach,et al. From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.
[58] Jerry Ma,et al. Quasi-hyperbolic momentum and Adam for deep learning , 2018, ICLR.