Lookahead Optimizer: k steps forward, 1 step back
暂无分享,去创建一个
Geoffrey E. Hinton | James Lucas | Jimmy Ba | Geoffrey Hinton | Michael R. Zhang | Michael Ruogu Zhang | Jimmy Ba | James Lucas
[1] Graham W. Taylor,et al. Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.
[2] Joshua Achiam,et al. On First-Order Meta-Learning Algorithms , 2018, ArXiv.
[3] Geoffrey E. Hinton. Using fast weights to deblur old memories , 1987 .
[4] Yoshua Bengio,et al. On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.
[5] Zeyuan Allen-Zhu,et al. Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..
[6] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.
[7] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .
[8] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.
[9] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[10] Benjamin Recht,et al. Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..
[11] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[12] Ren-Cang Li. 05-01 Sharpness in Rates of Convergence For CG and Symmetric Lanczos Methods , 2005 .
[13] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.
[14] Léon Bottou,et al. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.
[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[16] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[17] Yuanzhou Yang,et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.
[18] James Martens,et al. New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..
[19] Alexandre d'Aspremont,et al. Nonlinear Acceleration of Deep Neural Networks , 2018, ArXiv.
[20] D. Ruppert,et al. Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .
[21] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[22] Richard S. Varga,et al. Extrapolation methods: theory and practice , 1993, Numerical Algorithms.
[23] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[24] Renjie Liao,et al. Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.
[25] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.
[26] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .
[27] Emmanuel J. Candès,et al. Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.
[28] Donald G. M. Anderson. Iterative Procedures for Nonlinear Integral Equations , 1965, JACM.
[29] J. Schulman,et al. Reptile: a Scalable Metalearning Algorithm , 2018 .
[30] Gabriel Goh,et al. Why Momentum Really Works , 2017 .
[31] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.
[33] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[34] James Martens,et al. New perspectives on the natural gradient method , 2014, ArXiv.
[35] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.
[36] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[37] Andrew Gordon Wilson,et al. Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.
[38] Richard S. Zemel,et al. Aggregated Momentum: Stability Through Passive Damping , 2018, ICLR.
[39] Stefan Winkler,et al. The Unusual Effectiveness of Averaging in GAN Training , 2018, ICLR.
[40] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .
[41] Tom Schaul,et al. No more pesky learning rates , 2012, ICML.
[42] Alexandre d'Aspremont,et al. Nonlinear Acceleration of CNNs , 2018, ICLR.
[43] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[44] J. van Leeuwen,et al. Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.
[45] Andrew Gordon Wilson,et al. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.
[46] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[47] Ren-Cang Li,et al. Sharpness in rates of convergence for the symmetric Lanczos method , 2010, Math. Comput..
[48] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[49] James Demmel,et al. ImageNet Training in Minutes , 2017, ICPP.
[50] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[51] Satoshi Matsuoka,et al. Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs , 2018, ArXiv.