论文信息 - LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

The recent advancements in self-supervised learning, combined with the Transformer architecture, have enabled natural language processing (NLP) to achieve remarkably low perplexity. However, powerful NLP models necessitate increasing model size, leading to substantial computational and memory requirements. In this paper, we introduce an efficient inference framework tailored for large-scale generative language models. To reduce the model size, we employ a weight-only quantization strategy while preserving full precision for activations. As a result, we attain sub-4-bit quantization for each weight through non-uniform or uniform quantization techniques. Our proposed kernel, called LUT-GEMM, then accelerates quantized matrix multiplications, offering a flexible balance between compression ratio and accuracy. Unlike earlier matrix multiplication kernels that accommodated weight-only quantization, LUT-GEMM efficiently eliminates the resource-demanding dequantization process for both uniform and non-uniform quantization methods. By reducing the latency of individual GPUs and the overall inference process for large-scale language models, LUT-GEMM provides significant performance improvements in inference. The impact of LUT-GEMM is facilitated by implementing high compression ratios through low-bit quantization and efficient LUT-based operations, which decreases the number of required GPUs. For the OPT-175B model with 3-bit quantization, we show that LUT-GEMM accelerates the latency for generating each token by 2.1x compared to OPTQ, which requires costly dequantization. Consequently, LUT-GEMM enables inference of the OPT-175B model on a single GPU without noticeable degradation in accuracy or performance, while the non-quantized OPT-175B model requires a minimum of 8 GPUs.

[1] Song Han,et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , 2022, ArXiv.

[2] Dan Alistarh,et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , 2022, ArXiv.

[3] Kang Min Yoo,et al. AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models , 2022, EMNLP.

[4] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.

[5] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[6] Reza Yazdani Aminabadi,et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers , 2022, NeurIPS.

[7] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[8] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[9] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[10] Laurent El Shafey,et al. Pathways: Asynchronous Distributed Dataflow for ML , 2022, MLSys.

[11] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[12] Vahid Partovi Nia,et al. Kronecker Decomposition for GPT Compression , 2021, ACL.

[13] Shiyu Xu,et al. Multiplication Through a Single Look-Up-Table (LUT) in CNN Inference Computation , 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[14] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[15] Peisong Wang,et al. Towards Mixed-Precision Quantization of Neural Networks via Constrained Optimization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Kyungduk Kim,et al. What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers , 2021, EMNLP.

[17] Olatunji Ruwase,et al. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[19] Kurt Keutzer,et al. I-BERT: Integer-only BERT Quantization , 2021, ICML.

[20] Yoonjung Choi,et al. Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation , 2020, FINDINGS.

[21] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[22] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[23] Dongsoo Lee,et al. BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[24] Rana Ali Amjad,et al. Up or Down? Adaptive Rounding for Post-Training Quantization , 2020, ICML.

[25] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[26] Xu Liu,et al. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[27] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[28] Kushal Datta,et al. Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.

[29] Erich Elsen,et al. The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[30] Zhiru Zhang,et al. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[31] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32] Tianshi Chen,et al. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33] Yang Li,et al. GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking , 2018, NeurIPS.

[34] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[35] Dan Alistarh,et al. Model compression via distillation and quantization , 2018, ICLR.

[36] Mark D. McDonnell,et al. Training wide residual networks for deployment using a single bit for each weight , 2018, ICLR.

[37] Shuang Wu,et al. Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[38] Hongbin Zha,et al. Alternating Multi-bit Quantization for Recurrent Neural Networks , 2018, ICLR.

[39] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Suyog Gupta,et al. To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[41] Dhabaleswar K. Panda,et al. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? , 2017, EuroMPI.

[42] Scott A. Mahlke,et al. Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[43] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[44] Yurong Chen,et al. Network Sketching: Exploiting Binary Structure in Deep CNNs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Jungwon Lee,et al. Towards the Limit of Network Quantization , 2016, ICLR.

[46] Ali Farhadi,et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[47] Sachin S. Talathi,et al. Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[48] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[49] Mathias Beike,et al. Digital Integrated Circuits A Design Perspective , 2016 .

[50] Saurabh Gupta,et al. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[51] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[52] Ebru Arisoy,et al. Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[53] Pramod Kumar Meher,et al. LUT Optimization for Memory-Based Computation , 2010, IEEE Transactions on Circuits and Systems II: Express Briefs.

[54] Michael Garland,et al. Eﬃcient Sparse Matrix-Vector Multiplication on CUDA , 2008 .

[55] Ricardo L. de Queiroz,et al. LUT filters for quantized processing of signals , 2004, IEEE Transactions on Signal Processing.