论文信息 - SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification - 字舞流文

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification

The high computational and memory requirements of generative large language models (LLMs) make it challenging to serve them quickly and cheaply. This paper introduces SpecInfer, an LLM serving system that accelerates generative LLM inference with speculative inference and token tree verification. A key insight behind SpecInfer is to combine various collectively boost-tuned small language models to jointly predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified by the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality.

Zhihao Jia | G. Oliaro | Xupeng Miao | Zhihao Zhang | Zhuoming Chen | Daiyaan Arfeen | Zeyu Wang | Xinhao Cheng | Rae Ying Yee Wong | Reyna Abhyankar

[1] Furu Wei,et al. Inference with Reference: Lossless Acceleration of Large Language Models , 2023, ArXiv.

[2] Chunyuan Li,et al. Instruction Tuning with GPT-4 , 2023, ArXiv.

[3] Daniel Y. Fu,et al. High-throughput Generative Inference of Large Language Models with a Single GPU , 2023, ArXiv.

[4] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[5] Geoffrey Irving,et al. Accelerating Large Language Model Decoding with Speculative Sampling , 2023, ArXiv.

[6] Dan Alistarh,et al. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot , 2023, ICML.

[7] Y. Matias,et al. Fast Inference from Transformers via Speculative Decoding , 2022, ICML.

[8] Song Han,et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , 2022, ArXiv.

[9] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[10] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[11] Reza Yazdani Aminabadi,et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers , 2022, NeurIPS.

[12] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[13] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[14] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[15] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[16] Weizhu Chen,et al. What Makes Good In-Context Examples for GPT-3? , 2021, DEELIO.

[17] Carlos Efrain Quintero Narvaez,et al. Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization , 2022, OSDI.

[18] Luke Zettlemoyer,et al. GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, NeurIPS.

[19] Se Jung Kwon,et al. nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models , 2022, ArXiv.

[20] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[21] Brian Chmiel,et al. Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N: M Transposable Masks , 2021, NeurIPS.

[22] Hanrui Wang,et al. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).

[23] Jidong Zhai,et al. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections , 2021, OSDI.

[24] Cody Hao Yu,et al. Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[25] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[26] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[27] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.

[28] Ji Wang,et al. Pretraining-Based Natural Language Generation for Text Summarization , 2019, CoNLL.

[29] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[30] Jakob Uszkoreit,et al. Blockwise Parallel Decoding for Deep Autoregressive Models , 2018, NeurIPS.

[31] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[32] Andrew Chou,et al. Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[33] Yoav Freund,et al. A Short Introduction to Boosting , 1999 .

[34] F. Warren Burton,et al. Speculative computation, parallelism, and functional programming , 1985, IEEE Transactions on Computers.

[35] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .