Law of Balance and Stationary Distribution of Stochastic Gradient Descent
暂无分享,去创建一个
[1] S. Ganguli,et al. Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks , 2023, ArXiv.
[2] Tomer Galanti,et al. The Probabilistic Stability of Stochastic Gradient Descent , 2023, ArXiv.
[3] Suriya Gunasekar,et al. (S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability , 2023, ArXiv.
[4] Jason D. Lee,et al. Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability , 2022, ICLR.
[5] Raphael Berthier. Incremental Learning in Diagonal Linear Networks , 2022, Journal of machine learning research.
[6] Liu Ziyin,et al. Exact Phase Transitions in Deep Learning , 2022, ArXiv.
[7] Zihao Wang,et al. Posterior Collapse of a Linear Latent Variable Model , 2022, NeurIPS.
[8] Liu Ziyin,et al. Exact solutions of a deep linear network , 2022, NeurIPS.
[9] T. Zhao,et al. Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect , 2021, ICLR.
[10] Nicolas Flammarion,et al. Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity , 2021, NeurIPS.
[11] Takashi Mori,et al. Power-Law Escape Rate of SGD , 2021, ICML.
[12] Vladimir Braverman,et al. Benign Overfitting of Constant-Stepsize SGD for Linear Regression , 2021, COLT.
[13] Ameet Talwalkar,et al. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , 2021, ICLR.
[14] Sanjeev Arora,et al. On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs) , 2021, NeurIPS.
[15] Takashi Mori,et al. Strength of Minibatch Noise in SGD , 2021, ICLR.
[16] Soham De,et al. On the Origin of Implicit Regularization in Stochastic Gradient Descent , 2021, ICLR.
[17] Surya Ganguli,et al. Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics , 2020, ICLR.
[18] Haim Sompolinsky,et al. Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization , 2020, Physical Review X.
[19] Liu Ziyin,et al. Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent , 2020, ICML.
[20] D. Barrett,et al. Implicit Gradient Regularization , 2020, ICLR.
[21] Guillermo Valle Pérez,et al. Is SGD a Bayesian sampler? Well, almost , 2020, J. Mach. Learn. Res..
[22] Jonas Latz,et al. Analysis of stochastic gradient descent in continuous time , 2020, Statistics and Computing.
[23] Ohad Shamir,et al. Is Local SGD Better than Minibatch SGD? , 2020, ICML.
[24] Masashi Sugiyama,et al. A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima , 2020, ICLR.
[25] E Weinan,et al. Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations , 2018, J. Mach. Learn. Res..
[26] Sho Yaida,et al. Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.
[27] Wei Hu,et al. Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.
[28] David Rolnick,et al. How to Start Training: The Effect of Initialization and Architecture , 2018, NeurIPS.
[29] Zhanxing Zhu,et al. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.
[30] Boris Hanin,et al. Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? , 2018, NeurIPS.
[31] Raef Bassily,et al. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.
[32] Z. Papic,et al. Weak ergodicity breaking from quantum many-body scars , 2017, Nature Physics.
[33] Justin A. Sirignano,et al. Stochastic Gradient Descent in Continuous Time: A Central Limit Theorem , 2017, Stochastic Systems.
[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[35] Wenqing Hu,et al. On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.
[36] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[37] David M. Blei,et al. Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..
[38] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[39] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.
[40] Agnieszka B. Malinowska,et al. Noether's Theorem for Control Problems on Time Scales , 2014, 1406.0705.
[41] Brendan Fong,et al. A Noether Theorem for Markov Processes , 2012, 1203.2035.
[42] J. Mauro,et al. Continuously broken ergodicity. , 2007, The Journal of chemical physics.
[43] T. Misawa. Noether's theorem in symmetric stochastic calculus of variations , 1988 .
[44] Mor Shpigel Nacson,et al. Implicit Bias of the Step Size in Linear Diagonal Neural Networks , 2022, ICML.
[45] Liu Ziyin,et al. SGD May Never Escape Saddle Points , 2021, ArXiv.
[46] Alain Durmus,et al. Convergence rates and approximation results for SGD and its continuous-time counterpart , 2021, COLT.
[47] J. Mauro. Broken Ergodicity , 2021, Materials Kinetics.
[48] Lei Wu. How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .
[49] M. Mirzakhani,et al. Introduction to Ergodic theory , 2010 .
[50] Paul Embrechts,et al. Stochastic processes in insurance and finance , 2001 .
[51] W. Ebeling. Stochastic Processes in Physics and Chemistry , 1995 .
[52] Thirumalai,et al. Activated dynamics, loss of ergodicity, and transport in supercooled liquids. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.