暂无分享,去创建一个
Lifeng Shang | Qun Liu | Yichun Yin | Xin Jiang | Yujia Qin | Fengyu Wang | Cheng Chen | Zhi Wang | Xiao Chen | Zhiyuan Liu | Qun Liu | Lifeng Shang | Yujia Qin | Yichun Yin | Xin Jiang | Cheng Chen | Xiao Chen | Fengyu Wang | Zhi Wang | Zhiyuan Liu
[1] Minjia Zhang,et al. Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping , 2020, NeurIPS.
[2] Di He,et al. Efficient Training of BERT by Progressively Stacking , 2019, ICML.
[3] Oren Etzioni,et al. Green AI , 2019, Commun. ACM.
[4] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.
[5] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[6] Benoît Sagot,et al. What Does BERT Learn about the Structure of Language? , 2019, ACL.
[7] Chen Chen,et al. On the Transformer Growth for Progressive BERT Training , 2020, NAACL.
[8] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.
[9] Tianqi Chen,et al. Net2Net: Accelerating Learning via Knowledge Transfer , 2015, ICLR.
[10] Liwei Wang,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.
[11] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[12] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[13] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[14] Daniel Allen,et al. The transformer. , 2000, Nursing standard (Royal College of Nursing (Great Britain) : 1987).
[15] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.
[16] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[17] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[18] Qiang Liu,et al. Energy-Aware Neural Architecture Optimization with Fast Splitting Steepest Descent , 2019, ArXiv.
[19] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[21] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.
[22] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[23] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .
[24] Chen Xing,et al. Taking Notes on the Fly Helps Language Pre-Training , 2021, ICLR.
[25] Priyadarshini Panda,et al. Energy-efficient and Robust Cumulative Training with Net2Net Transformation , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).
[26] Song Han,et al. Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.
[27] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[28] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.
[29] Jingqiao Zhang,et al. Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup , 2020, ArXiv.
[30] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.
[31] Qiang Liu,et al. Splitting Steepest Descent for Growing Neural Architectures , 2019, NeurIPS.
[32] Kaisheng M. Wang,et al. PanGu-α: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation , 2021, ArXiv.
[33] Bo Liu,et al. Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks , 2021, NeurIPS.
[34] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[35] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[36] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[37] Qiang Liu,et al. Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting , 2020, ArXiv.
[38] Zhiyuan Liu,et al. Knowledge Inheritance for Pre-trained Language Models , 2021, ArXiv.
[39] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.