On a continuous time model of gradient descent dynamics and instability in deep learning
暂无分享,去创建一个
[1] Taiki Miyagawa. Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis , 2022, NeurIPS.
[2] Jason D. Lee,et al. Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability , 2022, ICLR.
[3] E. Hairer,et al. Geometric Numerical Integration , 2022, Oberwolfach Reports.
[4] Zachary Chase Lipton,et al. On the Maximum Hessian Eigenvalue and Generalization , 2022, ICBINB.
[5] Sanjeev Arora,et al. Understanding Gradient Descent on Edge of Stability in Deep Learning , 2022, ICML.
[6] S. Sra,et al. Understanding the unstable convergence of gradient descent , 2022, ICML.
[7] Michael I. Jordan,et al. Last-Iterate Convergence of Saddle Point Optimizers via High-Resolution Differential Equations , 2021, ArXiv.
[8] Orhan Firat,et al. A Loss Curvature Perspective on Training Instability in Deep Learning , 2021, ArXiv.
[9] Phillip E. Pope,et al. Stochastic Training is Not Necessary for Generalization , 2021, ICLR.
[10] Nadav Cohen,et al. Continuous vs. Discrete Optimization of Deep Neural Networks , 2021, NeurIPS.
[11] David G. T. Barrett,et al. Discretization Drift in Two-Player Games , 2021, ICML.
[12] Ameet Talwalkar,et al. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , 2021, ICLR.
[13] Takashi Mori,et al. Strength of Minibatch Noise in SGD , 2021, ICLR.
[14] Soham De,et al. On the Origin of Implicit Regularization in Stochastic Gradient Descent , 2021, ICLR.
[15] O. Shamir,et al. Implicit Regularization in ReLU Networks with the Square Loss , 2020, COLT.
[16] Liu Ziyin,et al. Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent , 2020, ICML.
[17] Jeff Donahue,et al. Training Generative Adversarial Networks by Solving Ordinary Differential Equations , 2020, NeurIPS.
[18] Ariel Kleiner,et al. Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.
[19] D. Barrett,et al. Implicit Gradient Regularization , 2020, ICLR.
[20] Jaehoon Lee,et al. Finite Versus Infinite Neural Networks: an Empirical Study , 2020, NeurIPS.
[21] 俊一 甘利. 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .
[22] Michael I. Jordan,et al. On dissipative symplectic integration with applications to gradient-based optimization , 2020, Journal of Statistical Mechanics: Theory and Experiment.
[23] Jascha Sohl-Dickstein,et al. The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.
[24] Kyunghyun Cho,et al. The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.
[25] Colin Wei,et al. Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.
[26] Nikola B. Kovachki,et al. Continuous Time Analysis of Momentum Methods , 2019, J. Mach. Learn. Res..
[27] Daniel P. Robinson,et al. Conformal symplectic and relativistic optimization , 2019, NeurIPS.
[28] Jaehoon Lee,et al. Wide neural networks of any depth evolve as linear models under gradient descent , 2019, NeurIPS.
[29] Shankar Krishnan,et al. An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.
[30] Wei Hu,et al. Width Provably Matters in Optimization for Deep Linear Neural Networks , 2019, ICML.
[31] Zhi Zhang,et al. Bag of Tricks for Image Classification with Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Vardan Papyan,et al. The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size. , 2018 .
[33] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.
[34] Michael I. Jordan,et al. Understanding the acceleration phenomenon via high-resolution differential equations , 2018, Mathematical Programming.
[35] Surya Ganguli,et al. An analytic theory of generalization dynamics and transfer learning in deep linear networks , 2018, ICLR.
[36] Sho Yaida,et al. Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.
[37] Ethan Dyer,et al. Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.
[38] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.
[39] Yoshua Bengio,et al. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.
[40] Wei Hu,et al. Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.
[41] Nathan Srebro,et al. Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.
[42] Philip M. Long,et al. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization , 2018, ArXiv.
[43] Abien Fred Agarap. Deep Learning using Rectified Linear Units (ReLU) , 2018, ArXiv.
[44] Yang Song,et al. Accelerating Natural Gradient with Higher-Order Invariance , 2018, ICML.
[45] Sanjeev Arora,et al. On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization , 2018, ICML.
[46] Philip M. Long,et al. Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks , 2018, Neural Computation.
[47] Thore Graepel,et al. The Mechanics of n-Player Differentiable Games , 2018, ICML.
[48] Hao Li,et al. Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.
[49] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.
[50] Yann Dauphin,et al. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.
[51] J. Zico Kolter,et al. Gradient descent GAN optimization is locally stable , 2017, NIPS.
[52] Sebastian Nowozin,et al. The Numerics of GANs , 2017, NIPS.
[53] Michael I. Jordan,et al. Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.
[54] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[55] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[56] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.
[57] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.
[58] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.
[60] E Weinan,et al. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.
[61] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[62] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[63] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[64] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.
[65] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.
[66] Yann Ollivier,et al. Riemannian metrics for neural networks II: recurrent networks and learning symbolic data sequences , 2013, 1306.0514.
[67] Yann Ollivier,et al. Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.
[68] John C. Duchi,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .
[69] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[70] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.
[71] P. Glendinning. Stability, Instability and Chaos: An Introduction to the Theory of Nonlinear Differential Equations , 1994 .
[72] Robert M. May,et al. Simple mathematical models with very complicated dynamics , 1976, Nature.
[73] P. Hartman. A lemma in the theory of structural stability of differential equations , 1960 .
[74] James B. Simon,et al. SGD Can Converge to Local Maxima , 2022, ICLR.
[75] Lei Wu,et al. The Multiscale Structure of Neural Network Loss Functions: The Effect on Optimization and Origin , 2022, ArXiv.
[76] Joan Bruna,et al. On Gradient Descent Convergence beyond the Edge of Stability , 2022, ArXiv.
[77] Karline Soetaert,et al. Solving Ordinary Differential Equations in R , 2012 .
[78] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[79] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.
[80] Harris Drucker,et al. Learning algorithms for classification: A comparison on handwritten digit recognition , 1995 .
[81] H. H. Rosenbrock,et al. An Automatic Method for Finding the Greatest or Least Value of a Function , 1960, Comput. J..