暂无分享,去创建一个
Liyuan Liu | Xiaodong Liu | Jianfeng Gao | Weizhu Chen | Jiawei Han | Haoming Jiang | Pengcheng He | Jianfeng Gao | Jiawei Han | Liyuan Liu | Weizhu Chen | Xiaodong Liu | Pengcheng He | Haoming Jiang
[1] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.
[2] Marcello Federico,et al. Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.
[3] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .
[5] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[6] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[7] Richard Socher,et al. A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation , 2018, ICLR.
[8] Tengyu Ma,et al. Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.
[9] Weizhu Chen,et al. DSCOVR: Randomized Primal-Dual Block Coordinate Algorithms for Asynchronous Distributed Optimization , 2017, J. Mach. Learn. Res..
[10] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[11] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[12] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.
[13] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[14] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.
[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[16] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[17] K. F. Gauss,et al. Theoria combinationis observationum erroribus minimis obnoxiae , 1823 .
[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[19] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[20] Iryna Gurevych,et al. Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks , 2017, ArXiv.
[21] Ondrej Bojar,et al. Training Tips for the Transformer Model , 2018, Prague Bull. Math. Linguistics.
[22] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[23] Marcello Federico,et al. Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.
[24] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.
[25] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.
[26] Xiang Ren,et al. Efficient Contextualized Representation: Language Model Pruning for Sequence Labeling , 2018, EMNLP.
[27] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Heng Ji,et al. Reliability-aware Dynamic Feature Composition for Name Tagging , 2019, ACL.
[29] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.
[30] Razvan Pascanu,et al. Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[31] Brian McWilliams,et al. The Shattered Gradients Problem: If resnets are the answer, then what is the question? , 2017, ICML.
[32] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.
[33] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.