Gadam: Combining Adaptivity with Iterate Averaging Gives Greater Generalisation
暂无分享,去创建一个
[1] Aaas News,et al. Book Reviews , 1893, Buffalo Medical and Surgical Journal.
[2] Stephen J. Roberts,et al. Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods , 2019 .
[3] Diego Granziol,et al. MLRG Deep Curvature , 2019, ArXiv.
[4] Yang Yuan,et al. Asymmetric Valleys: Beyond Sharp and Flat Local Minima , 2019, NeurIPS.
[5] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .
[6] Chris Dyer,et al. On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.
[7] Guodong Zhang,et al. Three Mechanisms of Weight Decay Regularization , 2018, ICLR.
[8] Kyunghyun Cho,et al. The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.
[9] Andrew Gordon Wilson,et al. Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.
[10] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.
[11] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.
[12] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.
[13] Zhangyang Wang,et al. Can We Gain More from Orthogonality Regularizations in Training Deep Networks? , 2018, NeurIPS.
[14] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.
[15] Frank Hutter,et al. Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..
[16] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.
[17] Neil Genzlinger. A. and Q , 2006 .
[18] Nicolas Le Roux,et al. Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.
[19] Mark D. McDonnell,et al. Training wide residual networks for deployment using a single bit for each weight , 2018, ICLR.
[20] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[21] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[22] Naiyan Wang,et al. Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.
[23] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[24] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[25] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.
[26] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .
[27] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.
[28] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..
[29] J. van Leeuwen,et al. Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.
[30] Andrew Gordon Wilson,et al. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.
[31] James Martens,et al. New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..
[32] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Quoc V. Le,et al. RandAugment: Practical data augmentation with no separate search , 2019, ArXiv.
[34] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.
[35] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[36] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[37] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[38] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[39] W. Marsden. I and J , 2012 .
[40] Quoc V. Le,et al. Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.
[42] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[43] Lei Wu. How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .
[44] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.
[45] Frank Hutter,et al. A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets , 2017, ArXiv.
[46] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[47] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[49] Phuong T. Tran,et al. On the Convergence Proof of AMSGrad and a New Version , 2019, IEEE Access.
[50] John C. Duchi. Introductory lectures on stochastic optimization , 2018, IAS/Park City Mathematics Series.
[51] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[52] Roman Vershynin,et al. High-Dimensional Probability , 2018 .
[53] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[54] Andrew Gordon Wilson,et al. A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.