Efficiently Scaling Transformer Inference

We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32 × larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model.

[1]  Blake A. Hechtman,et al.  Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models , 2022, ASPLOS.

[2]  J. Dean,et al.  A Review of Sparse Expert Models in Deep Learning , 2022, ArXiv.

[3]  M. Lewis,et al.  LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[4]  Reza Yazdani Aminabadi,et al.  DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale , 2022, SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[6]  Lawrence C. McAfee,et al.  Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.

[7]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[8]  Joseph Gonzalez,et al.  Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning , 2022, OSDI.

[9]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[10]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[11]  Manish Gupta,et al.  Compression of Deep Learning Models for Text: A Survey , 2020, ACM Trans. Knowl. Discov. Data.

[12]  Aakanksha Chowdhery,et al.  Sparse is Enough in Scaling Transformers , 2021, NeurIPS.

[13]  Ankur Bapna,et al.  Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference , 2021, EMNLP.

[14]  Noam Shazeer,et al.  GSPMD: General and Scalable Parallelization for ML Computation Graphs , 2021, ArXiv.

[15]  Oleg Rybakov,et al.  Pareto-Optimal Quantized ResNet Is Mostly 4-bit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16]  Zhijie Zhang,et al.  Learning N: M Fine-grained Structured Sparse Neural Networks From Scratch , 2021, ICLR.

[17]  Lucy J. Colwell,et al.  Rethinking Attention with Performers , 2020, ICLR.

[18]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[19]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[20]  Ji Li,et al.  Efficient Transformer-based Large Scale Language Representations using Hardware-friendly Block Structured Pruning , 2020, FINDINGS.

[21]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[22]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[23]  Hermann Ney,et al.  Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture , 2020, ACL.

[24]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[25]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[26]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[27]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[28]  Samyam Rajbhandari,et al.  ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Noam Shazeer,et al.  Fast Transformer Decoding: One Write-Head is All You Need , 2019, ArXiv.

[30]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[31]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[32]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[33]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[34]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[35]  G. Hua,et al.  LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks , 2018, ECCV.

[36]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[37]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[38]  Karsten M. Decker,et al.  Programming Environments for Massively Parallel Distributed Systems , 1994, Monte Verità.

[39]  Rolf Hempel,et al.  The MPI Message Passing Interface Standard , 1994 .