Gadam: Combining Adaptivity with Iterate Averaging Gives Greater Generalisation

We introduce Gadam, which combines Adam and iterate averaging (IA) to significantly improve generalisation performance without sacrificing adaptivity. We argue using high dimensional concentration theorems, that the noise reducing properties of IA are particularly appealing for large deep neural networks trained with small batch sizes. We contrast and compare with popular alternatives, such as the exponentially moving average (EMA), batch size increases or learning rate decreases. Furthermore, under mild conditions adaptive methods enjoy improved pre-asymptotic convergence, hence in finite time we expect this combination to be more effective than SGD + IA. We show that the combination of decoupled weight decay and IA allows for a high effective learning rate in networks with batch normalisation, which exerts additional regularisation. For language tasks (PTB) we show that Gadam is superior to finely tuned SGD, SGD with IA and Adam by a significant margin. For various image classification tasks (CIFAR-10/100, ImageNet-32) Gadam is consistently superior to finely tuned SGD and its partially adaptive variant GadamX outperforms SGD with IA.

[1]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[2]  Stephen J. Roberts,et al.  Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods , 2019 .

[3]  Diego Granziol,et al.  MLRG Deep Curvature , 2019, ArXiv.

[4]  Yang Yuan,et al.  Asymmetric Valleys: Beyond Sharp and Flat Local Minima , 2019, NeurIPS.

[5]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[6]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[7]  Guodong Zhang,et al.  Three Mechanisms of Weight Decay Regularization , 2018, ICLR.

[8]  Kyunghyun Cho,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[9]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[10]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[11]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Zhangyang Wang,et al.  Can We Gain More from Orthogonality Regularizations in Training Deep Networks? , 2018, NeurIPS.

[14]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[15]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[16]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[17]  Neil Genzlinger A. and Q , 2006 .

[18]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[19]  Mark D. McDonnell,et al.  Training wide residual networks for deployment using a single bit for each weight , 2018, ICLR.

[20]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[21]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[22]  Naiyan Wang,et al.  Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.

[23]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[24]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[25]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[26]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[27]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[28]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[29]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[30]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[31]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[32]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Quoc V. Le,et al.  RandAugment: Practical data augmentation with no separate search , 2019, ArXiv.

[34]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[35]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[38]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[39]  W. Marsden I and J , 2012 .

[40]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[42]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[43]  Lei Wu How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .

[44]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[45]  Frank Hutter,et al.  A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets , 2017, ArXiv.

[46]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[47]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[49]  Phuong T. Tran,et al.  On the Convergence Proof of AMSGrad and a New Version , 2019, IEEE Access.

[50]  John C. Duchi Introductory lectures on stochastic optimization , 2018, IAS/Park City Mathematics Series.

[51]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[52]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[53]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[54]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.