论文信息 - Efficiently Scaling Transformer Inference

Efficiently Scaling Transformer Inference

We study the problem of efﬁcient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efﬁciency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32 × larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model.

[1] Blake A. Hechtman,et al. Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models , 2022, ASPLOS.

[2] J. Dean,et al. A Review of Sparse Expert Models in Deep Learning , 2022, ArXiv.

[3] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[4] Reza Yazdani Aminabadi,et al. DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale , 2022, SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5] Daniel Y. Fu,et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[6] Lawrence C. McAfee,et al. Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.

[7] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[8] Joseph Gonzalez,et al. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.

[9] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[10] Renelito Delos Santos,et al. LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[11] Manish Gupta,et al. Compression of Deep Learning Models for Text: A Survey , 2020, ACM Trans. Knowl. Discov. Data.

[12] Aakanksha Chowdhery,et al. Sparse is Enough in Scaling Transformers , 2021, NeurIPS.

[13] Ankur Bapna,et al. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.

[14] Noam Shazeer,et al. GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[15] Oleg Rybakov,et al. Pareto-Optimal Quantized ResNet Is Mostly 4-bit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16] Zhijie Zhang,et al. Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch , 2021, ICLR.

[17] Lucy J. Colwell,et al. Rethinking Attention with Performers , 2020, ICLR.

[18] Orhan Firat,et al. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[19] Aurko Roy,et al. Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[20] Ji Li,et al. Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning , 2020, FINDINGS.

[21] Song Han,et al. HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[22] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[23] Hermann Ney,et al. Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture , 2020, ACL.

[24] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[25] Dan Klein,et al. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[26] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[27] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[28] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29] Noam Shazeer,et al. Fast Transformer Decoding: One Write-Head is All You Need , 2019, ArXiv.

[30] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[31] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[32] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[33] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.

[34] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[35] G. Hua,et al. LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[36] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[37] Robert A. van de Geijn,et al. Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[38] Karsten M. Decker,et al. Programming Environments for Massively Parallel Distributed Systems , 1994, Monte Verità.

[39] Rolf Hempel,et al. The MPI Message Passing Interface Standard , 1994 .