Dual Averaging is Surprisingly Effective for Deep Learning Optimization

First-order stochastic optimization methods are currently the most widely used class of methods for training deep neural networks. However, the choice of the optimizer has become an ad-hoc rule that can significantly affect the performance. For instance, SGD with momentum (SGD+M) is typically used in computer vision (CV) and Adam is used for training transformer models for Natural Language Processing (NLP). Using the wrong method can lead to significant performance degradation. Inspired by the dual averaging algorithm, we propose Modernized Dual Averaging (MDA), an optimizer that is able to perform as well as SGD+M in CV and as Adam in NLP. Our method is not adaptive and is significantly simpler than Adam. We show that MDA induces a decaying uncentered $L_2$-regularization compared to vanilla SGD+M and hypothesize that this may explain why it works on NLP problems where SGD+M fails.

[1]  Anatoli B. Juditsky,et al.  Unifying mirror descent and dual averaging , 2019, ArXiv.

[2]  Stéphan Clémençon,et al.  Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions , 2016, ICML.

[3]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[4]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[5]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[6]  Colin Wei,et al.  Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.

[7]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[8]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[9]  Stephen J. Wright,et al.  Manifold Identification in Dual Averaging for Regularized Stochastic Online Learning , 2012, J. Mach. Learn. Res..

[10]  Pascal Vincent,et al.  fastMRI: An Open Dataset and Benchmarks for Accelerated MRI , 2018, ArXiv.

[11]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[13]  Yu. Nesterov,et al.  Quasi-monotone Subgradient Methods for Nonsmooth Convex Minimization , 2015, J. Optim. Theory Appl..

[14]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[15]  Jonathan Eckstein,et al.  Nonlinear Proximal Point Algorithms Using Bregman Functions, with Applications to Convex Programming , 1993, Math. Oper. Res..

[16]  Kurt Keutzer,et al.  ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, ArXiv.

[17]  Francis Bach,et al.  On the Convergence of Adam and Adagrad , 2020, ArXiv.

[18]  Li Shen,et al.  Weighted AdaGrad with Unified Momentum , 2018 .

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  H. Robbins A Stochastic Approximation Method , 1951 .

[21]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[22]  John C. Duchi,et al.  Asymptotic optimality in stochastic optimization , 2016, The Annals of Statistics.

[23]  Aaron Defazio Offset Sampling Improves Deep Learning based Accelerated MRI Reconstructions by Exploiting Symmetry , 2019 .

[24]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[25]  Mehran Mesbahi,et al.  Online distributed optimization via dual averaging , 2013, 52nd IEEE Conference on Decision and Control.

[26]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[27]  Heinz H. Bauschke,et al.  Legendre functions and the method of random Bregman projections , 1997 .

[28]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[29]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[30]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[31]  Zeyuan Allen Zhu,et al.  Optimal Black-Box Reductions Between Optimization Objectives , 2016, NIPS.

[32]  Aaron Defazio,et al.  End-to-End Variational Networks for Accelerated MRI Reconstruction , 2020, MICCAI.

[33]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[34]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[35]  Andrew Zisserman,et al.  Deep Frank-Wolfe For Neural Network Optimization , 2018, ICLR.

[36]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[37]  Aaron Defazio Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis , 2020, ArXiv.

[38]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[39]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[40]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Michael G. Rabbat,et al.  Push-Sum Distributed Dual Averaging for convex optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[43]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[44]  Niao He,et al.  On the Convergence Rate of Stochastic Mirror Descent for Nonsmooth Nonconvex Optimization , 2018, 1806.04781.

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Marc Teboulle,et al.  Entropic Proximal Mappings with Applications to Nonlinear Programming , 1992, Math. Oper. Res..

[47]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[48]  S. Bos,et al.  Using weight decay to optimize the generalization ability of a perceptron , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[49]  Zhisong Pan,et al.  Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence , 2020, IEEE Transactions on Cybernetics.

[50]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[51]  Jaehoon Lee,et al.  On Empirical Comparisons of Optimizers for Deep Learning , 2019, ArXiv.

[52]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Aaron Defazio,et al.  On the convergence of the Stochastic Heavy Ball Method , 2020, ArXiv.

[54]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[55]  Shahin Shahrampour,et al.  Exponentially fast parameter estimation in networks using distributed dual averaging , 2013, 52nd IEEE Conference on Decision and Control.