A Review of Sparse Expert Models in Deep Learning
暂无分享,去创建一个
[1] Noah A. Smith,et al. Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models , 2022, ArXiv.
[2] Donald Metzler,et al. Confident Adaptive Language Modeling , 2022, ArXiv.
[3] Gerard de Melo,et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.
[4] Changho Hwang,et al. Tutel: Adaptive Mixture-of-Experts at Scale , 2022, ArXiv.
[5] Rodolphe Jenatton,et al. Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts , 2022, NeurIPS.
[6] J. Ainslie,et al. Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT , 2022, EMNLP.
[7] A. Aiken,et al. Optimizing Mixture of Experts using Dynamic Recompilations , 2022, ArXiv.
[8] J. Dean,et al. Designing Effective Sparse Expert Models , 2022, 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[9] Lu Yuan,et al. Residual Mixture of Experts , 2022, ArXiv.
[10] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[11] Marc van Zee,et al. Scaling Up Models and Data with t5x and seqio , 2022, J. Mach. Learn. Res..
[12] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.
[13] Andrew M. Dai,et al. Mixture-of-Experts with Expert Choice Routing , 2022, NeurIPS.
[14] A Survey on Dynamic Neural Networks for Natural Language Processing , 2022, 2202.07101.
[15] Blake A. Hechtman,et al. Unified Scaling Laws for Routed Language Models , 2022, ICML.
[16] Reza Yazdani Aminabadi,et al. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale , 2022, ICML.
[17] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.
[18] Diego de Las Casas,et al. Improving language models by retrieving from trillions of tokens , 2021, ICML.
[19] Dan Su,et al. Speechmoe2: Mixture-of-Experts Model with Improved Routing , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[20] Li Dong,et al. Swin Transformer V2: Scaling Up Capacity and Resolution , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Mike Lewis,et al. Tricks for Training Sparse Translation Models , 2021, NAACL.
[22] T. Zhao,et al. Taming Sparsely Activated Transformer with Stochastic Experts , 2021, ICLR.
[23] Noah A. Smith,et al. DEMix Layers: Disentangling Domains for Modular Language Modeling , 2021, NAACL.
[24] Yi Tay,et al. Efficient Transformers: A Survey , 2020, ACM Comput. Surv..
[25] Xupeng Miao,et al. Dense-to-Sparse Gate for Mixture-of-Experts , 2021, ArXiv.
[26] Robert Gmyr,et al. Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition , 2021, ArXiv.
[27] Aakanksha Chowdhery,et al. Sparse is Enough in Scaling Transformers , 2021, NeurIPS.
[28] Ankur Bapna,et al. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.
[29] Alexandre Muzio,et al. Scalable and Efficient MoE Training for Multitask Multilingual Models , 2021, ArXiv.
[30] Myle Ott,et al. On Anytime Learning at Macroscale , 2021, CoLLAs.
[31] Carlos Riquelme,et al. Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.
[32] Jason Weston,et al. Hash Layers For Large Sparse Models , 2021, NeurIPS.
[33] Aakanksha Chowdhery,et al. DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning , 2021, NeurIPS.
[34] Chang Zhou,et al. Exploring Sparse Expert Models and Beyond , 2021, ArXiv.
[35] Dong Yu,et al. SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts , 2021, Interspeech.
[36] A. Dosovitskiy,et al. MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.
[37] David R. So,et al. Carbon Emissions and Large Neural Network Training , 2021, ArXiv.
[38] Naman Goyal,et al. BASE Layers: Simplifying Training of Large, Sparse Models , 2021, ICML.
[39] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[40] Hyung Won Chung,et al. Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.
[41] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..
[42] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.
[43] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[44] Holger Schwenk,et al. Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..
[45] Joan Puigcerver,et al. Scalable Transfer Learning with Expert Models , 2020, ICLR.
[46] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[47] Zangwei Zheng,et al. Sparse-MLP: A Fully-MLP Architecture with Conditional Computation , 2021, ArXiv.
[48] Ming-Wei Chang,et al. Retrieval Augmented Language Model Pre-Training , 2020, ICML.
[49] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[50] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[51] Yu Zhang,et al. Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.
[52] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[53] P. S. Castro,et al. Rigging the Lottery: Making All Tickets Winners , 2019, ICML.
[54] Omer Levy,et al. Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.
[55] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[56] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[57] Shinji Watanabe,et al. Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration , 2019, INTERSPEECH.
[58] Luke Zettlemoyer,et al. Sparse Networks from Scratch: Faster Training without Losing Performance , 2019, ArXiv.
[59] Wei Li,et al. Behavior sequence transformer for e-commerce recommendation in Alibaba , 2019, Proceedings of the 1st International Workshop on Deep Learning Practice for High-Dimensional Sparse Data.
[60] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.
[61] Ignacio Cases,et al. Routing Networks and the Challenges of Modular and Compositional Computation , 2019, ArXiv.
[62] Erich Elsen,et al. The State of Sparsity in Deep Neural Networks , 2019, ArXiv.
[63] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[64] Quoc V. Le,et al. Diversity and Depth in Per-Example Routing Models , 2018, ICLR.
[65] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[66] David Barber,et al. Modular Networks: Learning to Decompose Neural Computation , 2018, NeurIPS.
[67] Dustin Tran,et al. Mesh-TensorFlow: Deep Learning for Supercomputers , 2018, NeurIPS.
[68] Zhe Zhao,et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , 2018, KDD.
[69] Nikhil R. Devanur,et al. PipeDream: Fast and Efficient Pipeline Parallel DNN Training , 2018, ArXiv.
[70] Shuang Xu,et al. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[71] Matthew Riemer,et al. Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.
[72] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[73] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[74] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[75] Tinne Tuytelaars,et al. Expert Gate: Lifelong Learning with a Network of Experts , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[76] Joelle Pineau,et al. Conditional Computation in Neural Networks for faster models , 2015, ArXiv.
[77] Milos Hauskrecht,et al. Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.
[78] Marc'Aurelio Ranzato,et al. Learning Factored Representations in a Deep Mixture of Experts , 2013, ICLR.
[79] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.
[80] Joseph N. Wilson,et al. Twenty Years of Mixture of Experts , 2012, IEEE Transactions on Neural Networks and Learning Systems.
[81] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.
[82] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.
[83] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.
[84] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[85] Robert A. Jacobs,et al. Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.
[86] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.