AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models

There are growing interests in adapting large-scale language models using parameter-efficient fine-tuning methods. However, accelerating the model itself and achieving better inference efficiency through model compression has not been thoroughly explored yet. Model compression could provide the benefits of reducing memory footprints, enabling low-precision computations, and ultimately achieving cost-effective inference. To combine parameter-efficient adaptation and model compression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task. Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors. During the adaptation phase, the binary values are frozen for all tasks, while the scaling factors are fine-tuned for the downstream task. We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving>10x compression ratio under 4-bit quantization and>1,000x reduction in the number of trainable parameters.

[1]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[2]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[3]  Qun Liu,et al.  HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks , 2022, ACL.

[4]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[5]  Tom B. Brown,et al.  Predictability and Surprise in Large Generative Models , 2022, FAccT.

[6]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[7]  Han Hu,et al.  SimMIM: a Simple Framework for Masked Image Modeling , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Brian Lester,et al.  SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer , 2021, ACL.

[10]  Graham Neubig,et al.  Towards a Unified View of Parameter-Efficient Transfer Learning , 2021, ICLR.

[11]  Tijmen Blankevoort,et al.  Understanding and Overcoming the Challenges of Efficient Transformer Quantization , 2021, EMNLP.

[12]  Kyungduk Kim,et al.  What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers , 2021, EMNLP.

[13]  Minlie Huang,et al.  PPT: Pre-trained Prompt Tuning for Few-shot Learning , 2021, ACL.

[14]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[15]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[16]  S. Riedel,et al.  Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , 2021, ACL.

[17]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[18]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[19]  Zhengxiao Du,et al.  GPT Understands, Too , 2021, AI Open.

[20]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[21]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[22]  Siu Cheung Hui,et al.  Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1/n Parameters , 2021, ICLR.

[23]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[24]  Yang Yang,et al.  BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction , 2021, ICLR.

[25]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[26]  Daniel M. Roy,et al.  Pruning Neural Networks at Initialization: Why are We Missing the Mark? , 2020, International Conference on Learning Representations.

[27]  Yoonjung Choi,et al.  Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation , 2020, FINDINGS.

[28]  Tao Yu,et al.  DART: Open-Domain Structured Data Record to Text Generation , 2020, NAACL.

[29]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[30]  Ron Banner,et al.  Improving Post Training Neural Quantization: Layer-wise Calibration and Integer Programming , 2020, ArXiv.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Dongsoo Lee,et al.  BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  David J. Schwab,et al.  Training BatchNorm and Only BatchNorm: On the Expressive Power of Random Features in CNNs , 2020, ICLR.

[34]  Dan Klein,et al.  Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers , 2020, ArXiv.

[35]  Aleksander Wawer,et al.  SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization , 2019, EMNLP.

[36]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[37]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[38]  Kushal Datta,et al.  Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.

[39]  Steven K. Esser,et al.  Learned Step Size Quantization , 2019, ICLR.

[40]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[41]  Zhiru Zhang,et al.  Improving Neural Network Quantization without Retraining using Outlier Channel Splitting , 2019, ICML.

[42]  Hongbin Zha,et al.  Alternating Multi-bit Quantization for Recurrent Neural Networks , 2018, ICLR.

[43]  Bo Chen,et al.  Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[45]  Claire Gardent,et al.  The WebNLG Challenge: Generating Text from RDF Data , 2017, INLG.

[46]  Verena Rieser,et al.  The E2E Dataset: New Challenges For End-to-End Generation , 2017, SIGDIAL Conference.

[47]  Yurong Chen,et al.  Network Sketching: Exploiting Binary Structure in Deep CNNs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[49]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[50]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[51]  Ebru Arisoy,et al.  Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[52]  Se Jung Kwon,et al.  nuQmm: Quantized MatMul for Efficient Inference of Large-Scale Generative Language Models , 2022, ArXiv.

[53]  Joe Davison,et al.  Compacter: Efficient Low-Rank Hypercomplex Adapter Layers , 2021, NeurIPS.

[54]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[55]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[56]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .