A Review of Sparse Expert Models in Deep Learning

Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.

[1]  Noah A. Smith,et al.  Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models , 2022, ArXiv.

[2]  Donald Metzler,et al.  Confident Adaptive Language Modeling , 2022, ArXiv.

[3]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[4]  Changho Hwang,et al.  Tutel: Adaptive Mixture-of-Experts at Scale , 2022, ArXiv.

[5]  Rodolphe Jenatton,et al.  Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts , 2022, NeurIPS.

[6]  J. Ainslie,et al.  Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT , 2022, EMNLP.

[7]  A. Aiken,et al.  Optimizing Mixture of Experts using Dynamic Recompilations , 2022, ArXiv.

[8]  J. Dean,et al.  Designing Effective Sparse Expert Models , 2022, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[9]  Lu Yuan,et al.  Residual Mixture of Experts , 2022, ArXiv.

[10]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[11]  Marc van Zee,et al.  Scaling Up Models and Data with t5x and seqio , 2022, J. Mach. Learn. Res..

[12]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[13]  Andrew M. Dai,et al.  Mixture-of-Experts with Expert Choice Routing , 2022, NeurIPS.

[14]  A Survey on Dynamic Neural Networks for Natural Language Processing , 2022, 2202.07101.

[15]  Blake A. Hechtman,et al.  Unified Scaling Laws for Routed Language Models , 2022, ICML.

[16]  Reza Yazdani Aminabadi,et al.  DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale , 2022, ICML.

[17]  Quoc V. Le,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[18]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[19]  Dan Su,et al.  Speechmoe2: Mixture-of-Experts Model with Improved Routing , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Li Dong,et al.  Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Mike Lewis,et al.  Tricks for Training Sparse Translation Models , 2021, NAACL.

[22]  T. Zhao,et al.  Taming Sparsely Activated Transformer with Stochastic Experts , 2021, ICLR.

[23]  Noah A. Smith,et al.  DEMix Layers: Disentangling Domains for Modular Language Modeling , 2021, NAACL.

[24]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[25]  Xupeng Miao,et al.  Dense-to-Sparse Gate for Mixture-of-Experts , 2021, ArXiv.

[26]  Robert Gmyr,et al.  Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition , 2021, ArXiv.

[27]  Aakanksha Chowdhery,et al.  Sparse is Enough in Scaling Transformers , 2021, NeurIPS.

[28]  Ankur Bapna,et al.  Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.

[29]  Alexandre Muzio,et al.  Scalable and Efficient MoE Training for Multitask Multilingual Models , 2021, ArXiv.

[30]  Myle Ott,et al.  On Anytime Learning at Macroscale , 2021, CoLLAs.

[31]  Carlos Riquelme,et al.  Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.

[32]  Jason Weston,et al.  Hash Layers For Large Sparse Models , 2021, NeurIPS.

[33]  Aakanksha Chowdhery,et al.  DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning , 2021, NeurIPS.

[34]  Chang Zhou,et al.  Exploring Sparse Expert Models and Beyond , 2021, ArXiv.

[35]  Dong Yu,et al.  SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts , 2021, Interspeech.

[36]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[37]  David R. So,et al.  Carbon Emissions and Large Neural Network Training , 2021, ArXiv.

[38]  Naman Goyal,et al.  BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.

[39]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[40]  Hyung Won Chung,et al.  Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.

[41]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[42]  Colin Raffel,et al.  Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[43]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[44]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[45]  Joan Puigcerver,et al.  Scalable Transfer Learning with Expert Models , 2020, ICLR.

[46]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[47]  Zangwei Zheng,et al.  Sparse-MLP: A Fully-MLP Architecture with Conditional Computation , 2021, ArXiv.

[48]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[49]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[50]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[51]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[52]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[53]  P. S. Castro,et al.  Rigging the Lottery: Making All Tickets Winners , 2019, ICML.

[54]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.

[55]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[56]  Samyam Rajbhandari,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[57]  Shinji Watanabe,et al.  Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration , 2019, INTERSPEECH.

[58]  Luke Zettlemoyer,et al.  Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.

[59]  Wei Li,et al.  Behavior sequence transformer for e-commerce recommendation in Alibaba , 2019, Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data.

[60]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[61]  Ignacio Cases,et al.  Routing Networks and the Challenges of Modular and Compositional Computation , 2019, ArXiv.

[62]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[63]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[64]  Quoc V. Le,et al.  Diversity and Depth in Per-Example Routing Models , 2018, ICLR.

[65]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[66]  David Barber,et al.  Modular Networks: Learning to Decompose Neural Computation , 2018, NeurIPS.

[67]  Dustin Tran,et al.  Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.

[68]  Zhe Zhao,et al.  Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , 2018, KDD.

[69]  Nikhil R. Devanur,et al.  PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.

[70]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[71]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[72]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[73]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[74]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[75]  Tinne Tuytelaars,et al.  Expert Gate: Lifelong Learning with a Network of Experts , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Joelle Pineau,et al.  Conditional Computation in Neural Networks for faster models , 2015, ArXiv.

[77]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[78]  Marc'Aurelio Ranzato,et al.  Learning Factored Representations in a Deep Mixture of Experts , 2013, ICLR.

[79]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[80]  Joseph N. Wilson,et al.  Twenty Years of Mixture of Experts , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[81]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[82]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[83]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[84]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[85]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[86]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.