PWPROP: A Progressive Weighted Adaptive Method for Training Deep Neural Networks

In recent years, adaptive optimization methods for deep learning have attracted considerable attention. AMSGRAD indicates that the adaptive methods may be hard to converge to optimal solutions of some convex problems due to the divergence of its adaptive learning rate as in ADAM. However, we find that AMSGRAD may generalize worse than ADAM for some deep learning tasks. We first show that AMSGRAD may not find a flat minimum. So how can we design an optimization method to find a flat minimum with low training loss? Few works focus on this important problem. We propose a novel progressive weighted adaptive optimization algorithm, called PWPROP, with fewer hyperparameters than its counterparts such as ADAM. By intuitively constructing a “sharp-flat minima” model, we show that how different second-order estimates affect the ability to escape a sharp minimum. Moreover, we also prove that PWPROP can address the non-convergence issue of ADAM and has a sublinear convergence rate for non-convex problems. Extensive experimental results show that PWPROP is effective and suitable for various deep learning architectures such as Transformer, and achieves state-of-the-art results.

[1]  Jungmin Kwon,et al.  ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , 2021, ICML.

[2]  Sashank J. Reddi,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[3]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[4]  J. Duncan,et al.  AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.

[5]  Ariel Kleiner,et al.  Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.

[6]  Seong Joon Oh,et al.  AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights , 2020, ICLR.

[7]  Kurt Keutzer,et al.  ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.

[8]  Xinjiang Wang,et al.  AdaX: Adaptive Gradient Descent with Exponential Long Term Memory , 2020, ArXiv.

[9]  Jiawei Han,et al.  Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[10]  Seong Joon Oh,et al.  An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods , 2020, ArXiv.

[11]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[12]  Jianfeng Gao,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[13]  Licheng Jiao,et al.  signADAM++: Learning Confidences for Deep Neural Networks , 2019, 2019 International Conference on Data Mining Workshops (ICDMW).

[14]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[15]  Lijun Zhang,et al.  SAdam: A Variant of Adam for Strongly Convex Functions , 2019, ICLR.

[16]  Ruoyu Sun,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[17]  Tara N. Sainath,et al.  A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[18]  Aleksander Madry,et al.  Robustness May Be at Odds with Accuracy , 2018, ICLR.

[19]  Bin Dong,et al.  Nostalgic Adam: Weighting More of the Past Gradients When Designing the Adaptive Learning Rate , 2018, IJCAI.

[20]  Joseph Redmon,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[21]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[22]  Kurt Keutzer,et al.  Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.

[23]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[24]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[25]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[26]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[27]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[28]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[29]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[30]  Aleksander Madry,et al.  Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[33]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[34]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[35]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[36]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[37]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[39]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[42]  John C. Duchi,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .

[43]  H. Robbins A Stochastic Approximation Method , 1951 .

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[45]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[46]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[47]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[48]  Yash Kant,et al.  ICLR Reproducibility Challenge Report (Padam : Closing The Generalization Gap Of Adaptive Gradient Methods in Training Deep Neural Networks) , 2019, ArXiv.