暂无分享,去创建一个
[1] Mubarak Shah,et al. Norm-Preservation: Why Residual Networks Can Become Extremely Deep? , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[2] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.
[3] Xiaodong Liu,et al. Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.
[4] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.
[5] Alexei Baevski,et al. Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.
[6] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[7] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[8] Jonathan Berant,et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.
[9] Jimmy J. Lin,et al. End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.
[10] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[11] Omer Levy,et al. SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.
[12] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[13] Chang Zhou,et al. Cognitive Graph for Multi-Hop Reading Comprehension at Scale , 2019, ACL.
[14] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..
[15] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[16] Edouard Grave,et al. Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.
[17] Julian Salazar,et al. Transformers without Tears: Improving the Normalization of Self-Attention , 2019, ArXiv.
[18] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[19] James Demmel,et al. Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.
[20] Di He,et al. Efficient Training of BERT by Progressively Stacking , 2019, ICML.
[21] Tie-Yan Liu,et al. On Layer Normalization in the Transformer Architecture , 2020, ICML.
[22] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[23] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[24] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.
[25] Barak A. Pearlmutter,et al. Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..
[26] Jason Weston,et al. Curriculum learning , 2009, ICML '09.
[27] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[28] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[29] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.
[30] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[31] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[32] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[33] Jingbo Zhu,et al. Learning Deep Transformer Models for Machine Translation , 2019, ACL.
[34] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.
[35] René Vidal,et al. Curriculum Dropout , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[36] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[37] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[39] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[40] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[41] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[42] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[43] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[44] Jürgen Schmidhuber,et al. Highway and Residual Networks learn Unrolled Iterative Estimation , 2016, ICLR.