Iterate Averaging Helps: An Alternative Perspective in Deep Learning

Iterate averaging has a rich history in optimisation, but has only very recently been popularised in deep learning. We investigate its effects in a deep learning context, and argue that previous explanations on its efficacy, which place a high importance on the local geometry (flatness vs sharpness) of final solutions, are not necessarily relevant. We instead argue that the robustness of iterate averaging towards the typically very high estimation noise in deep learning and the various regularisation effects averaging exert, are the key reasons for the performance gain, indeed this effect is made even more prominent due to the over-parameterisation of modern networks. Inspired by this, we propose Gadam, which combines Adam with iterate averaging to address one of key problems of adaptive optimisers that they often generalise worse. Without compromising adaptivity and with minimal additional computational burden, we show that Gadam (and its variant GadamX) achieve a generalisation performance that is consistently superior to tuned SGD and is even on par or better compared to SGD with iterate averaging on various image classification (CIFAR 10/100 and ImageNet 32$\times$32) and language tasks (PTB).

[1]  Zhangyang Wang,et al.  Can We Gain More from Orthogonality Regularizations in Training Deep Networks? , 2018, NeurIPS.

[2]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[3]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[4]  Roman Vershynin,et al.  High-Dimensional Probability , 2018 .

[5]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[8]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[9]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[10]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[11]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[14]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[15]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[16]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[17]  Yang Yuan,et al.  Asymmetric Valleys: Beyond Sharp and Flat Local Minima , 2019, NeurIPS.

[18]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Naiyan Wang,et al.  Data-Driven Sparse Structure Selection for Deep Neural Networks , 2017, ECCV.

[20]  Stephen J. Roberts,et al.  Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods , 2019 .

[21]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[22]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[23]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[24]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[27]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[28]  Guodong Zhang,et al.  Three Mechanisms of Weight Decay Regularization , 2018, ICLR.

[29]  Kyunghyun Cho,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[30]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[31]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Quoc V. Le,et al.  RandAugment: Practical data augmentation with no separate search , 2019, ArXiv.

[33]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[34]  Tomaso A. Poggio,et al.  Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.

[35]  Mark D. McDonnell,et al.  Training wide residual networks for deployment using a single bit for each weight , 2018, ICLR.

[36]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[37]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[38]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[39]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[40]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[41]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[42]  Phuong T. Tran,et al.  On the Convergence Proof of AMSGrad and a New Version , 2019, IEEE Access.

[43]  John C. Duchi Introductory lectures on stochastic optimization , 2018, IAS/Park City Mathematics Series.

[44]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[47]  Frank Hutter,et al.  A Downsampled Variant of ImageNet as an Alternative to the CIFAR datasets , 2017, ArXiv.