ON EMPIRICAL COMPARISONS OF OPTIMIZERS
暂无分享,去创建一个
Chris J. Maddison | George E. Dahl | Christopher J. Shallue | Dami Choi | Jaehoon Lee | Zachary Nado
[1] Sashank J. Reddi,et al. On the Convergence of Adam and Beyond , 2018, ICLR.
[2] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.
[3] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[4] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.
[5] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[6] Thomas Brox,et al. Striving for Simplicity: The All Convolutional Net , 2014, ICLR.
[7] Yoshua Bengio,et al. Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..
[8] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[9] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[10] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Li Shen,et al. A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Jasper Snoek,et al. Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.
[14] H. Robbins. A Stochastic Approximation Method , 1951 .
[15] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[16] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.
[17] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .
[18] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[19] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[20] Laurence Aitchison,et al. A unified theory of adaptive stochastic gradient descent as Bayesian filtering , 2018, ArXiv.
[21] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[22] Frank Schneider,et al. DeepOBS: A Deep Learning Optimizer Benchmark Suite , 2019, ICLR.
[23] Yann LeCun,et al. Improving the convergence of back-propagation learning with second-order methods , 1989 .
[24] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.
[25] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[26] Aurélien Lucchi,et al. Ellipsoidal Trust Region Methods and the Marginal Value of Hessian Information for Neural Network Training , 2019, ArXiv.
[27] Bo Chen,et al. MnasNet: Platform-Aware Neural Architecture Search for Mobile , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Ioannis Mitliagkas,et al. YellowFin and the Art of Momentum Tuning , 2017, MLSys.
[29] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[30] Li Shen,et al. On the Convergence of Weighted AdaGrad with Momentum for Training Deep Neural Networks , 2018 .
[31] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[32] Jerry Ma,et al. Quasi-hyperbolic momentum and Adam for deep learning , 2018, ICLR.
[33] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .
[34] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.
[35] Sanjiv Kumar,et al. Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.
[36] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.
[37] Leslie N. Smith,et al. A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay , 2018, ArXiv.
[38] Yong Yu,et al. AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.
[39] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[40] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[41] Jascha Sohl-Dickstein,et al. Measuring the Effects of Data Parallelism on Neural Network Training , 2018, J. Mach. Learn. Res..
[42] Guangwen Yang,et al. NAMSG: An Efficient Method For Training Neural Networks , 2019, ArXiv.
[43] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.
[44] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[45] Olivier Teytaud,et al. Critical Hyper-Parameters: No Random, No Cry , 2017, ArXiv.
[46] R. Tyrrell Rockafellar,et al. Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.
[47] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .
[48] Philipp Hennig,et al. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.
[49] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[50] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[51] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[52] Soham De,et al. Convergence Guarantees for RMSProp and ADAM in Non-Convex Optimization and an Empirical Comparison to Nesterov Acceleration , 2018, 1807.06766.
[53] Quoc V. Le,et al. AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.
[54] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.
[55] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.