Efficiently Scaling Transformer Inference
暂无分享,去创建一个
J. Dean | Anselm Levskaya | Jacob Devlin | James Bradbury | J. Heek | Aakanksha Chowdhery | Shivani Agrawal | Reiner Pope | Sholto Douglas | Kefan Xiao | Sholto Douglas
[1] Blake A. Hechtman,et al. Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models , 2022, ASPLOS.
[2] J. Dean,et al. A Review of Sparse Expert Models in Deep Learning , 2022, ArXiv.
[3] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.
[4] Reza Yazdani Aminabadi,et al. DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale , 2022, SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Daniel Y. Fu,et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.
[6] Lawrence C. McAfee,et al. Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.
[7] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.
[8] Joseph Gonzalez,et al. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.
[9] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.
[10] Renelito Delos Santos,et al. LaMDA: Language Models for Dialog Applications , 2022, ArXiv.
[11] Manish Gupta,et al. Compression of Deep Learning Models for Text: A Survey , 2020, ACM Trans. Knowl. Discov. Data.
[12] Aakanksha Chowdhery,et al. Sparse is Enough in Scaling Transformers , 2021, NeurIPS.
[13] Ankur Bapna,et al. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.
[14] Noam Shazeer,et al. GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.
[15] Oleg Rybakov,et al. Pareto-Optimal Quantized ResNet Is Mostly 4-bit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[16] Zhijie Zhang,et al. Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch , 2021, ICLR.
[17] Lucy J. Colwell,et al. Rethinking Attention with Performers , 2020, ICLR.
[18] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.
[19] Aurko Roy,et al. Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.
[20] Ji Li,et al. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning , 2020, FINDINGS.
[21] Song Han,et al. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.
[22] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[23] Hermann Ney,et al. Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture , 2020, ACL.
[24] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.
[25] Dan Klein,et al. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.
[26] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[27] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.
[28] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[29] Noam Shazeer,et al. Fast Transformer Decoding: One Write-Head is All You Need , 2019, ArXiv.
[30] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).
[31] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.
[32] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[33] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.
[34] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.
[35] G. Hua,et al. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.
[36] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[37] Robert A. van de Geijn,et al. Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..
[38] Karsten M. Decker,et al. Programming Environments for Massively Parallel Distributed Systems , 1994, Monte Verità.
[39] Rolf Hempel,et al. The MPI Message Passing Interface Standard , 1994 .