LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
暂无分享,去创建一个
Se Jung Kwon | Dongsoo Lee | Byeongwook Kim | Jeonghoon Kim | Gunho Park | Baeseong Park | Youngjoo Lee | Sungjae Lee | Minsub Kim | Beomseok Kwon | S. Kwon
[1] Song Han,et al. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models , 2022, ArXiv.
[2] Dan Alistarh,et al. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , 2022, ArXiv.
[3] Kang Min Yoo,et al. AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models , 2022, EMNLP.
[4] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.
[5] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.
[6] Reza Yazdani Aminabadi,et al. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers , 2022, NeurIPS.
[7] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.
[8] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..
[9] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.
[10] Laurent El Shafey,et al. Pathways: Asynchronous Distributed Dataflow for ML , 2022, MLSys.
[11] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.
[12] Vahid Partovi Nia,et al. Kronecker Decomposition for GPT Compression , 2021, ACL.
[13] Shiyu Xu,et al. Multiplication Through a Single Look-Up-Table (LUT) in CNN Inference Computation , 2021, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[14] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.
[15] Peisong Wang,et al. Towards Mixed-Precision Quantization of Neural Networks via Constrained Optimization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[16] Kyungduk Kim,et al. What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers , 2021, EMNLP.
[17] Olatunji Ruwase,et al. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep learning , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Amar Phanishayee,et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.
[19] Kurt Keutzer,et al. I-BERT: Integer-only BERT Quantization , 2021, ICML.
[20] Yoonjung Choi,et al. Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation , 2020, FINDINGS.
[21] Abdel-rahman Mohamed,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.
[22] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[23] Dongsoo Lee,et al. BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.
[24] Rana Ali Amjad,et al. Up or Down? Adaptive Rounding for Post-Training Quantization , 2020, ICML.
[25] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[26] Xu Liu,et al. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.
[27] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.
[28] Kushal Datta,et al. Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.
[29] Erich Elsen,et al. The State of Sparsity in Deep Neural Networks , 2019, ArXiv.
[30] Zhiru Zhang,et al. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.
[31] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[32] Tianshi Chen,et al. Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[33] Yang Li,et al. GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking , 2018, NeurIPS.
[34] Jeffrey S. Vetter,et al. NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[35] Dan Alistarh,et al. Model compression via distillation and quantization , 2018, ICLR.
[36] Mark D. McDonnell,et al. Training wide residual networks for deployment using a single bit for each weight , 2018, ICLR.
[37] Shuang Wu,et al. Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.
[38] Hongbin Zha,et al. Alternating Multi-bit Quantization for Recurrent Neural Networks , 2018, ICLR.
[39] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[40] Suyog Gupta,et al. To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.
[41] Dhabaleswar K. Panda,et al. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? , 2017, EuroMPI.
[42] Scott A. Mahlke,et al. Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[43] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[44] Yurong Chen,et al. Network Sketching: Exploiting Binary Structure in Deep CNNs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Jungwon Lee,et al. Towards the Limit of Network Quantization , 2016, ICLR.
[46] Ali Farhadi,et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.
[47] Sachin S. Talathi,et al. Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.
[48] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.
[49] Mathias Beike,et al. Digital Integrated Circuits A Design Perspective , 2016 .
[50] Saurabh Gupta,et al. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[51] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[52] Ebru Arisoy,et al. Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.
[53] Pramod Kumar Meher,et al. LUT Optimization for Memory-Based Computation , 2010, IEEE Transactions on Circuits and Systems II: Express Briefs.
[54] Michael Garland,et al. Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .
[55] Ricardo L. de Queiroz,et al. LUT filters for quantized processing of signals , 2004, IEEE Transactions on Signal Processing.