Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Adaptive gradient algorithms [1–4] borrow the moving average idea of heavy ball acceleration to estimate accurate first- and second-order moments of gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory [5] and also in many empirical cases [6] is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to effec-tively speedup the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an (cid:15) -approximate first-order stationary point within O (cid:0) (cid:15) − 3 . 5 (cid:1) stochastic gradient complexity on the nonconvex stochastic problems ( e.g. deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on both CNNs and transformers, and sets new SoTAs for many popular networks and frameworks, e.g. ResNet [7], ConvNext [8], ViT [9], Swin [10], MAE [11], LSTM [12], TransformerXL [13] and BERT [14]. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT

[1]  M. Cord,et al.  DeiT III: Revenge of the ViT , 2022, ECCV.

[2]  Teck Khim Ng,et al.  Mugs: A Multi-Granular Self-Supervised Learning Framework , 2022, ArXiv.

[3]  Cho-Jui Hsieh,et al.  Towards Efficient and Scalable Sharpness-Aware Minimization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ping Luo,et al.  Context Autoencoder for Self-Supervised Representation Learning , 2022, ArXiv.

[5]  Mingrui Liu,et al.  Understanding AdamW through Proximal Methods and Scale-Freeness , 2022, Trans. Mach. Learn. Res..

[6]  Zhouchen Lin,et al.  Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the O ( (cid:15) − 7 / 4 ) Complexity , 2022 .

[7]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Shuicheng Yan,et al.  MetaFormer is Actually What You Need for Vision , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Joey Tianyi Zhou,et al.  Efficient Sharpness-aware Minimization for Improved Training of Neural Networks , 2021, ICLR.

[11]  Cho-Jui Hsieh,et al.  When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations , 2021, ICLR.

[12]  Xiaozhe Ren,et al.  Large-Scale Deep Learning Optimizations: A Comprehensive Survey , 2021, ArXiv.

[13]  Ross Wightman,et al.  ResNet strikes back: An improved training procedure in timm , 2021, ArXiv.

[14]  Jun Zhu,et al.  Tianshou: a Highly Modularized Deep Reinforcement Learning Library , 2021, J. Mach. Learn. Res..

[15]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[16]  Tianbao Yang,et al.  A Novel Convergence Analysis for Algorithms of the Adam Family and Beyond , 2021, 2104.14840.

[17]  Jungmin Kwon,et al.  ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , 2021, ICML.

[18]  George E. Dahl,et al.  A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes , 2021, ArXiv.

[19]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[20]  Marco Mondelli,et al.  Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks , 2020, ICML.

[21]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[22]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[23]  Seong Joon Oh,et al.  AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights , 2020, ICLR.

[24]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Shuicheng Yan,et al.  Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond , 2021, NeurIPS.

[26]  Yizhou Wang,et al.  Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization , 2021, ArXiv.

[27]  Mingyi Hong,et al.  RMSprop converges with proper hyper-parameter , 2021, ICLR.

[28]  Mingrui Liu,et al.  Adam+: A Stochastic Method with Adaptive Variance Reduction , 2020, ArXiv.

[29]  Sashank J. Reddi,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[30]  J. Duncan,et al.  AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.

[31]  Yair Carmon,et al.  Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations , 2020, COLT.

[32]  Marco Mondelli,et al.  Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology , 2020, NeurIPS.

[33]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[34]  Ashok Cutkosky,et al.  Momentum Improves Normalized SGD , 2020, ICML.

[35]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[36]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[37]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[38]  James Demmel,et al.  Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[39]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[40]  John C. Duchi,et al.  Lower bounds for non-convex stochastic optimization , 2019, Mathematical Programming.

[41]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[42]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[44]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[45]  Zhouchen Lin,et al.  Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.

[46]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[47]  Mingyi Hong,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[48]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[49]  Yuan Cao,et al.  On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[50]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[51]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[52]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[53]  Dimitris S. Papailiopoulos,et al.  Stability and Generalization of Learning Algorithms that Converge to Global Optima , 2017, ICML.

[54]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[55]  Yi Zhou,et al.  Characterization of Gradient Dominance and Regularity Conditions for Neural Networks , 2017, ArXiv.

[56]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[57]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[58]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[59]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[60]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[61]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[62]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[63]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[65]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[67]  Tara N. Sainath,et al.  Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[68]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[69]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[70]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  H. Robbins A Stochastic Approximation Method , 1951 .

[72]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[73]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[74]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[75]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[76]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .