DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
暂无分享,去创建一个
Ammar Ahmad Awan | Samyam Rajbhandari | Jeff Rasley | Reza Yazdani Aminabadi | Zhewei Yao | Conglong Li | Minjia Zhang | Yuxiong He
[1] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[2] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[3] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.
[4] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.
[5] Zornitsa Kozareva,et al. Efficient Large Scale Language Modeling with Mixtures of Experts , 2021, ArXiv.
[6] Olatunji Ruwase,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[7] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[8] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[9] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.
[10] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[11] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[12] Nathanael Chambers,et al. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.
[13] Michael W. Mahoney,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.
[14] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[15] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.
[16] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[17] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[19] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.
[20] Alexander M. Rush,et al. Block Pruning For Faster Transformers , 2021, EMNLP.
[21] Furu Wei,et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.
[22] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[23] Guokun Lai,et al. RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.
[24] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.
[25] Kurt Keutzer,et al. HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Sandro Pezzelle,et al. The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.
[27] Alexandre Muzio,et al. Scalable and Efficient MoE Training for Multitask Multilingual Models , 2021, ArXiv.
[28] Mohammad Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[29] Noam Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.
[30] Zhilin Yang,et al. FastMoE: A Fast Mixture-of-Expert Training System , 2021, ArXiv.
[31] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.
[32] Andrew Chou,et al. Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.
[33] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[34] Kurt Keutzer,et al. HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks , 2020, NeurIPS.
[35] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.
[36] Andrew M. Dai,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ArXiv.
[37] Jian Jiao,et al. Taming Sparsely Activated Transformer with Stochastic Experts , 2021, ArXiv.
[38] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[39] Reza Ebrahimpour,et al. Mixture of experts: a literature survey , 2014, Artificial Intelligence Review.
[40] Kaisheng Yao,et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[41] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[42] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[43] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[44] An Yang,et al. M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining , 2021, ArXiv.