暂无分享,去创建一个
Jian Jiao | Hany Hassan | Jianfeng Gao | Simiao Zuo | Tuo Zhao | Xiaodong Liu | Ruofei Zhang | Young Jin Kim
[1] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.
[2] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[3] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.
[4] Armen Aghajanyan,et al. Better Fine-Tuning by Reducing Representational Collapse , 2020, ICLR.
[5] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.
[6] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.
[7] Quoc V. Le,et al. The Evolved Transformer , 2019, ICML.
[8] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.
[9] Chang Zhou,et al. Exploring Sparse Expert Models and Beyond , 2021, ArXiv.
[10] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[11] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.
[12] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.
[13] Jianfeng Gao,et al. SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2019, ACL.
[14] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[16] Jason Weston,et al. Hash Layers For Large Sparse Models , 2021, NeurIPS.
[17] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[18] Tao Qin,et al. Depth Growing for Neural Machine Translation , 2019, ACL.
[19] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[20] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[21] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.
[22] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[23] Oren Etzioni,et al. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.
[24] Alexandre Muzio,et al. Scalable and Efficient MoE Training for Multitask Multilingual Models , 2021, ArXiv.
[25] Marc'Aurelio Ranzato,et al. Mixture Models for Diverse Machine Translation: Tricks of the Trade , 2019, ICML.
[26] Jingbo Zhu,et al. Learning Deep Transformer Models for Machine Translation , 2019, ACL.
[27] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[28] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[29] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[30] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[31] Jianfeng Gao,et al. Very Deep Transformers for Neural Machine Translation , 2020, ArXiv.
[32] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.
[33] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[34] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.