SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
暂无分享,去创建一个
Zhihao Jia | G. Oliaro | Xupeng Miao | Zhihao Zhang | Zhuoming Chen | Daiyaan Arfeen | Zeyu Wang | Xinhao Cheng | Rae Ying Yee Wong | Reyna Abhyankar
[1] Furu Wei,et al. Inference with Reference: Lossless Acceleration of Large Language Models , 2023, ArXiv.
[2] Chunyuan Li,et al. Instruction Tuning with GPT-4 , 2023, ArXiv.
[3] Daniel Y. Fu,et al. High-throughput Generative Inference of Large Language Models with a Single GPU , 2023, ArXiv.
[4] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[5] Geoffrey Irving,et al. Accelerating Large Language Model Decoding with Speculative Sampling , 2023, ArXiv.
[6] Dan Alistarh,et al. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot , 2023, ICML.
[7] Y. Matias,et al. Fast Inference from Transformers via Speculative Decoding , 2022, ICML.
[8] Song Han,et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , 2022, ArXiv.
[9] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.
[10] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.
[11] Reza Yazdani Aminabadi,et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers , 2022, NeurIPS.
[12] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[13] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[14] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.
[15] Quoc V. Le,et al. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.
[16] Weizhu Chen,et al. What Makes Good In-Context Examples for GPT-3? , 2021, DEELIO.
[17] Carlos Efrain Quintero Narvaez,et al. Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization , 2022, OSDI.
[18] Luke Zettlemoyer,et al. GPT3.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, NeurIPS.
[19] Se Jung Kwon,et al. nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models , 2022, ArXiv.
[20] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.
[21] Brian Chmiel,et al. Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N: M Transposable Masks , 2021, NeurIPS.
[22] Hanrui Wang,et al. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).
[23] Jidong Zhai,et al. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections , 2021, OSDI.
[24] Cody Hao Yu,et al. Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.
[25] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[26] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.
[27] Alexander Aiken,et al. TASO: optimizing deep learning computation with automatic generation of graph substitutions , 2019, SOSP.
[28] Ji Wang,et al. Pretraining-Based Natural Language Generation for Text Summarization , 2019, CoNLL.
[29] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[30] Jakob Uszkoreit,et al. Blockwise Parallel Decoding for Deep Autoregressive Models , 2018, NeurIPS.
[31] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.
[32] Andrew Chou,et al. Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.
[33] Yoav Freund,et al. A Short Introduction to Boosting , 1999 .
[34] F. Warren Burton,et al. Speculative computation, parallelism, and functional programming , 1985, IEEE Transactions on Computers.
[35] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .