PWPROP: A Progressive Weighted Adaptive Method for Training Deep Neural Networks
暂无分享,去创建一个
[1] Jungmin Kwon,et al. ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks , 2021, ICML.
[2] Sashank J. Reddi,et al. Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.
[3] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[4] J. Duncan,et al. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.
[5] Ariel Kleiner,et al. Sharpness-Aware Minimization for Efficiently Improving Generalization , 2020, ICLR.
[6] Seong Joon Oh,et al. AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights , 2020, ICLR.
[7] Kurt Keutzer,et al. ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.
[8] Xinjiang Wang,et al. AdaX: Adaptive Gradient Descent with Exponential Long Term Memory , 2020, ArXiv.
[9] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.
[10] Seong Joon Oh,et al. An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods , 2020, ArXiv.
[11] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.
[12] Jianfeng Gao,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[13] Licheng Jiao,et al. signADAM++: Learning Confidences for Deep Neural Networks , 2019, 2019 International Conference on Data Mining Workshops (ICDMW).
[14] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[15] Lijun Zhang,et al. SAdam: A Variant of Adam for Strongly Convex Functions , 2019, ICLR.
[16] Ruoyu Sun,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[17] Tara N. Sainath,et al. A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).
[18] Aleksander Madry,et al. Robustness May Be at Odds with Accuracy , 2018, ICLR.
[19] Bin Dong,et al. Nostalgic Adam: Weighting More of the Past Gradients When Designing the Adaptive Learning Rate , 2018, IJCAI.
[20] Joseph Redmon,et al. YOLOv3: An Incremental Improvement , 2018, ArXiv.
[21] Andrew Gordon Wilson,et al. Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.
[22] Kurt Keutzer,et al. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.
[23] Sashank J. Reddi,et al. On the Convergence of Adam and Beyond , 2018, ICLR.
[24] Hao Li,et al. Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.
[25] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[26] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.
[27] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.
[28] Léon Bottou,et al. Wasserstein Generative Adversarial Networks , 2017, ICML.
[29] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.
[30] Aleksander Madry,et al. Towards Deep Learning Models Resistant to Adversarial Attacks , 2017, ICLR.
[31] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[32] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.
[33] Stefano Soatto,et al. Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.
[34] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.
[35] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[36] Nikos Komodakis,et al. Wide Residual Networks , 2016, BMVC.
[37] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.
[39] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[40] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[41] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.
[42] John C. Duchi,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .
[43] H. Robbins. A Stochastic Approximation Method , 1951 .
[44] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[45] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[46] Jürgen Schmidhuber,et al. Flat Minima , 1997, Neural Computation.
[47] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .
[48] Yash Kant,et al. ICLR Reproducibility Challenge Report (Padam : Closing The Generalization Gap Of Adaptive Gradient Methods in Training Deep Neural Networks) , 2019, ArXiv.