AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights
暂无分享,去创建一个
Seong Joon Oh | Gyuwan Kim | Dongyoon Han | Sangdoo Yun | Byeongho Heo | Youngjung Uh | Sanghyuk Chun | Jung-Woo Ha | Jung-Woo Ha | Dongyoon Han | Youngjung Uh | Sanghyuk Chun | Sangdoo Yun | Byeongho Heo | Gyuwan Kim
[1] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[2] Suha Kwak,et al. Proxy Anchor Loss for Deep Metric Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Jung-Woo Ha,et al. NSML: Meet the MLaaS platform with a real-world case study , 2018, ArXiv.
[4] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[5] Pete Warden,et al. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.
[6] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.
[7] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[8] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[9] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[10] Mark Sandler,et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[11] Dawn Song,et al. Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Pietro Perona,et al. The Caltech-UCSD Birds-200-2011 Dataset , 2011 .
[13] Xiaogang Wang,et al. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[15] Seong Joon Oh,et al. Learning De-biased Representations with Biased Representations , 2019, ICML.
[16] Junmo Kim,et al. Deep Pyramidal Residual Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..
[18] Minz Won,et al. Visualizing and Understanding Self-attention based Music Tagging , 2019, ArXiv.
[19] Dongyoon Han,et al. ReXNet: Diminishing Representational Bottleneck on Convolutional Neural Network , 2020, ArXiv.
[20] Xavier Serra,et al. Data-Driven Harmonic Filters for Audio Representation Learning , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[21] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[22] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Tim Salimans,et al. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.
[24] Xavier Serra,et al. Evaluation of CNN-based Automatic Music Tagging Models , 2020, ArXiv.
[25] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[27] Minhyung Cho,et al. Riemannian approach to batch normalization , 2017, NIPS.
[28] Sanjeev Arora,et al. Theoretical Analysis of Auto Rate-Tuning by Batch Normalization , 2018, ICLR.
[29] Elad Hoffer,et al. Norm matters: efficient and accurate normalization schemes in deep networks , 2018, NeurIPS.
[30] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[31] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[32] Michael I. Mandel,et al. Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.
[33] Thomas Hofmann,et al. Exponential convergence rates for Batch Normalization: The power of length-direction decoupling in non-convex optimization , 2018, AISTATS.
[34] Xingyi Zhou,et al. Objects as Points , 2019, ArXiv.
[35] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[36] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.
[37] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[38] Guido Mont'ufar,et al. Optimization Theory for ReLU Neural Networks Trained with Normalization Layers , 2020, ICML.
[39] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[40] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.
[41] James Demmel,et al. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.
[42] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[43] Jonathan Krause,et al. 3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.
[44] Matthias Bethge,et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.
[45] Aleksander Madry,et al. Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.
[46] Andrea Vedaldi,et al. Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.
[47] J. Duncan,et al. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.
[48] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[49] Kaiming He,et al. Group Normalization , 2018, ECCV.
[50] Aleksander Madry,et al. How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.
[51] Xavier Serra,et al. Toward Interpretable Music Tagging with Self-Attention , 2019, ArXiv.
[52] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[53] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[54] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[55] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[56] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.
[57] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[58] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.
[59] Aleksander Madry,et al. Robustness May Be at Odds with Accuracy , 2018, ICLR.
[60] Xavier Serra,et al. Automatic music tagging with Harmonic CNN , 2019 .
[61] Ankit Shah,et al. DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System , 2017, DCASE.
[62] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.
[63] Quoc V. Le,et al. AutoAugment: Learning Augmentation Policies from Data , 2018, ArXiv.
[64] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.
[65] Guodong Zhang,et al. Three Mechanisms of Weight Decay Regularization , 2018, ICLR.
[66] Silvio Savarese,et al. Deep Metric Learning via Lifted Structured Feature Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[67] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[68] H. H. Rosenbrock,et al. An Automatic Method for Finding the Greatest or Least Value of a Function , 1960, Comput. J..
[69] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .
[70] Seong Joon Oh,et al. An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods , 2020, ArXiv.
[71] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..