论文信息 - LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning - 字舞流文

LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs), however, it still requires expensive activation memory to update low-rank weights. Reducing the number of LoRA layers or using activation recomputation could harm the fine-tuning performance or increase the computational overhead. In this work, we present LoRA-FA, a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation. LoRA-FA chooses to freeze the projection-down weight of $A$ and update the projection-up weight of $B$ in each LoRA layer. It ensures the change of model weight reside in a low-rank space during LLMs fine-tuning, while eliminating the requirement to store full-rank input activations. We conduct extensive experiments across multiple model types (RoBERTa, T5, LLaMA) and model scales. Our results show that LoRA-FA can always achieve close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA. Furthermore, LoRA-FA can reduce the overall memory cost by up to 1.4$\times$ compared to LoRA.

S. Shi | Bo Li | X. Chu | Lin Zhang | Longteng Zhang | Shaohuai Shi

[1] Bill Yuchen Lin,et al. LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition , 2023, ArXiv.

[2] Eric Michael Smith,et al. Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[3] Sherin Muckatira,et al. Stack More Layers Differently: High-Rank Training Through Low-Rank Updates , 2023, ArXiv.

[4] S. Shi,et al. Evaluation and Optimization of Gradient Compression for Distributed Deep Learning , 2023, ArXiv.

[5] Eric P. Xing,et al. One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning , 2023, ArXiv.

[6] Luke Zettlemoyer,et al. QLoRA: Efficient Finetuning of Quantized LLMs , 2023, NeurIPS.

[7] Myle Ott,et al. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel , 2023, ArXiv.

[8] Hyung Won Chung,et al. UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining , 2023, ICLR.

[9] Alexander W. Bukharin,et al. Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , 2023, ICLR.

[10] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[11] Noah A. Smith,et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions , 2022, ACL.

[12] Omer Levy,et al. Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor , 2022, ACL.

[13] M. Lewis,et al. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale , 2022, ArXiv.

[14] Daniel Y. Fu,et al. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[15] Colin Raffel,et al. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , 2022, NeurIPS.

[16] Lawrence C. McAfee,et al. Reducing Activation Recomputation in Large Transformer Models , 2022, MLSys.

[17] Noah A. Smith,et al. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , 2022, EMNLP.

[18] Tom B. Brown,et al. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , 2022, ArXiv.

[19] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[20] Reza Yazdani Aminabadi,et al. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[21] M. Lewis,et al. 8-bit Optimizers via Block-wise Quantization , 2021, ICLR.

[22] Minlie Huang,et al. PPT: Pre-trained Prompt Tuning for Few-shot Learning , 2021, ACL.

[23] Owain Evans,et al. TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[24] Quoc V. Le,et al. Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[25] Yoav Goldberg,et al. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models , 2021, ACL.

[26] Yelong Shen,et al. LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[27] Rami Al-Rfou,et al. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models , 2021, Transactions of the Association for Computational Linguistics.

[28] Brian Lester,et al. The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[29] Olatunji Ruwase,et al. ZeRO-Offload: Democratizing Billion-Scale Model Training , 2021, USENIX ATC.

[30] Colin Raffel,et al. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer , 2020, NAACL.

[31] Dawn Song,et al. Measuring Massive Multitask Language Understanding , 2020, ICLR.

[32] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[33] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[34] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[35] Lysandre Debut,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[36] P. Abbeel,et al. Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization , 2019, MLSys.

[37] Samyam Rajbhandari,et al. ZeRO: Memory optimizations Toward Training Trillion Parameter Models , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[38] M. Shoeybi,et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[39] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[40] Ali Farhadi,et al. HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[41] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[42] Mona Attariyan,et al. Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[43] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[44] Jin-Hyuk Hong,et al. Semantic Sentence Matching with Densely-connected Recurrent and Co-attentive Information , 2018, AAAI.

[45] Samuel R. Bowman,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[46] Oren Etzioni,et al. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[47] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[49] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[50] Tianqi Chen,et al. Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[51] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[52] Percy Liang,et al. Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[53] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[54] Jean-Marc Jézéquel,et al. Measuring Models , 2009 .