DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

[1]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[2]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[4]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[5]  Zornitsa Kozareva,et al.  Efficient Large Scale Language Modeling with Mixtures of Experts , 2021, ArXiv.

[6]  Olatunji Ruwase,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[8]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[9]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[10]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[11]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[12]  Nathanael Chambers,et al.  A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[13]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[16]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[17]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[20]  Alexander M. Rush,et al.  Block Pruning For Faster Transformers , 2021, EMNLP.

[21]  Furu Wei,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[24]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[25]  Kurt Keutzer,et al.  HAWQ: Hessian AWare Quantization of Neural Networks With Mixed-Precision , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[27]  Alexandre Muzio,et al.  Scalable and Efficient MoE Training for Multitask Multilingual Models , 2021, ArXiv.

[28]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[29]  Noam Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, ArXiv.

[30]  Zhilin Yang,et al.  FastMoE: A Fast Mixture-of-Expert Training System , 2021, ArXiv.

[31]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[32]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[33]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[34]  Kurt Keutzer,et al.  HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Networks , 2020, NeurIPS.

[35]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[36]  Andrew M. Dai,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ArXiv.

[37]  Jian Jiao,et al.  Taming Sparsely Activated Transformer with Stochastic Experts , 2021, ArXiv.

[38]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[39]  Reza Ebrahimpour,et al.  Mixture of experts: a literature survey , 2014, Artificial Intelligence Review.

[40]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[41]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[42]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[43]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[44]  An Yang,et al.  M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining , 2021, ArXiv.