Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
暂无分享,去创建一个
[1] M. Cord,et al. DeiT III: Revenge of the ViT , 2022, ECCV.
[2] Teck Khim Ng,et al. Mugs: A Multi-Granular Self-Supervised Learning Framework , 2022, ArXiv.
[3] Cho-Jui Hsieh,et al. Towards Efficient and Scalable Sharpness-Aware Minimization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Ping Luo,et al. Context Autoencoder for Self-Supervised Representation Learning , 2022, ArXiv.
[5] Mingrui Liu,et al. Understanding AdamW through Proximal Methods and Scale-Freeness , 2022, Trans. Mach. Learn. Res..
[6] Zhouchen Lin,et al. Restarted Nonconvex Accelerated Gradient Descent: No More Polylogarithmic Factor in the O ( (cid:15) − 7 / 4 ) Complexity , 2022 .
[7] Trevor Darrell,et al. A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Shuicheng Yan,et al. MetaFormer is Actually What You Need for Vision , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Joey Tianyi Zhou,et al. Efficient Sharpness-aware Minimization for Improved Training of Neural Networks , 2021, ICLR.
[11] Cho-Jui Hsieh,et al. When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations , 2021, ICLR.
[12] Xiaozhe Ren,et al. Large-Scale Deep Learning Optimizations: A Comprehensive Survey , 2021, ArXiv.
[13] Ross Wightman,et al. ResNet strikes back: An improved training procedure in timm , 2021, ArXiv.
[14] Jun Zhu,et al. Tianshou: a Highly Modularized Deep Reinforcement Learning Library , 2021, J. Mach. Learn. Res..
[15] Trevor Darrell,et al. Early Convolutions Help Transformers See Better , 2021, NeurIPS.
[16] Tianbao Yang,et al. A Novel Convergence Analysis for Algorithms of the Adam Family and Beyond , 2021, 2104.14840.
[17] Jungmin Kwon,et al. ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , 2021, ICML.
[18] George E. Dahl,et al. A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes , 2021, ArXiv.
[19] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.
[20] Marco Mondelli,et al. Tight Bounds on the Smallest Eigenvalue of the Neural Tangent Kernel for Deep ReLU Networks , 2020, ICML.
[21] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[22] Ariel Kleiner,et al. Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.
[23] Seong Joon Oh,et al. AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights , 2020, ICLR.
[24] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[25] Shuicheng Yan,et al. Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond , 2021, NeurIPS.
[26] Yizhou Wang,et al. Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization , 2021, ArXiv.
[27] Mingyi Hong,et al. RMSprop converges with proper hyper-parameter , 2021, ICLR.
[28] Mingrui Liu,et al. Adam+: A Stochastic Method with Adaptive Variance Reduction , 2020, ArXiv.
[29] Sashank J. Reddi,et al. Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.
[30] J. Duncan,et al. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.
[31] Yair Carmon,et al. Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations , 2020, COLT.
[32] Marco Mondelli,et al. Global Convergence of Deep Networks with One Wide Layer Followed by Pyramidal Topology , 2020, NeurIPS.
[33] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[34] Ashok Cutkosky,et al. Momentum Improves Normalized SGD , 2020, ICML.
[35] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.
[36] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[37] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[38] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[39] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.
[40] John C. Duchi,et al. Lower bounds for non-convex stochastic optimization , 2019, Mathematical Programming.
[41] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[42] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[43] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[44] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.
[45] Zhouchen Lin,et al. Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.
[46] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[47] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[48] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[49] Yuan Cao,et al. On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.
[50] Sashank J. Reddi,et al. On the Convergence of Adam and Beyond , 2018, ICLR.
[51] Michael I. Jordan,et al. Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.
[52] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[53] Dimitris S. Papailiopoulos,et al. Stability and Generalization of Learning Algorithms that Converge to Global Optima , 2017, ICML.
[54] Sanjiv Kumar,et al. Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.
[55] Yi Zhou,et al. Characterization of Gradient Dominance and Regularity Conditions for Neural Networks , 2017, ArXiv.
[56] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.
[57] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.
[58] Le Song,et al. Diverse Neural Network Learns True Target Functions , 2016, AISTATS.
[59] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.
[60] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.
[61] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[62] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .
[63] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[65] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Gerald Penn,et al. Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[67] Tara N. Sainath,et al. Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.
[68] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.
[69] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[70] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[71] H. Robbins. A Stochastic Approximation Method , 1951 .
[72] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.
[73] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[74] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.
[75] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .
[76] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .