暂无分享,去创建一个
Yu Cheng | Zhe Gan | Jingjing Liu | Tom Goldstein | Furong Huang | Chen Zhu
[1] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[2] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.
[3] Richard Socher,et al. A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation , 2018, ICLR.
[4] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[5] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[6] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.
[7] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[8] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[9] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[10] Yang You,et al. Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.
[11] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[12] Sashank J. Reddi,et al. Why ADAM Beats SGD for Attention Models , 2019, ArXiv.
[13] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.
[14] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.
[16] Yu Cheng,et al. FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2020, ICLR.
[17] Tom Schaul,et al. No more pesky learning rates , 2012, ICML.
[18] J Quinonero Candela,et al. Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment , 2006, Lecture Notes in Computer Science.
[19] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[20] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[21] Elad Hoffer,et al. Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.
[22] Chris Brockett,et al. Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.
[23] Denis Yarats,et al. On the adequacy of untuned warmup for adaptive optimization , 2019, AAAI.
[24] Marcin Junczys-Dowmunt,et al. Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation , 2018, EMNLP.
[25] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[26] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[27] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.
[28] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[29] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.
[30] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.
[31] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.
[32] Liu Ziyin,et al. LaProp: Separating Momentum and Adaptivity in Adam , 2020, 2002.04839.
[33] Ziyin Liu,et al. LaProp: a Better Way to Combine Momentum with Adaptive Gradient , 2020, ArXiv.
[34] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.
[35] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.
[36] Sekhar Tatikonda,et al. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.
[37] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.
[38] Philipp Hennig,et al. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.
[39] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.
[40] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .
[41] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[42] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[43] Marcello Federico,et al. Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.
[44] Sanjiv Kumar,et al. Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.
[45] Quoc V. Le,et al. Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.
[46] Andreas Veit,et al. Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.
[47] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[48] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[49] Renjie Liao,et al. Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.
[50] Peter Clark,et al. The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.
[51] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[52] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[53] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.