LoRAMoE: Revolutionizing Mixture of Experts for Maintaining World Knowledge in Language Model Alignment

Supervised fine-tuning (SFT) is a crucial step for large language models (LLMs), enabling them to align with human instructions and enhance their capabilities in downstream tasks. When the models are required to align with a broader range of downstream tasks, or there is a desire to notably improve the performance on a specific task, a substantial increase in fine-tuning data often emerges as the solution. However, we find that large-scale increases in instruction data can disrupt the world knowledge previously stored in the LLMs, i.e., world knowledge forgetting. In this paper, we introduce LoRAMoE to address the above challenge. The LoRAMoE is a plugin version of Mixture of Experts (MoE). The plugin form ensures the integrity of world knowledge by freezing the backbone model during the training phase. We then propose the use of localized balancing constraints to coordinate parts of experts for task utilization, meanwhile enabling other experts to fully leverage the world knowledge stored in the models. Experimental results demonstrate that LoRAMoE can reasonably coordinate experts based on data type during inference, and even dramatically increasing instruction data does not result in knowledge forgetting. Moreover, LoRAMoE provides additional benefits for the performance of downstream tasks, indicating the potential of our approach for multi-task learning.

[1]  Zhangyin Feng,et al.  Trends in Integration of Knowledge and Large Language Models: A Survey and Taxonomy of Methods, Benchmarks, and Applications , 2023, ArXiv.

[2]  Lianmin Zheng,et al.  S-LoRA: Serving Thousands of Concurrent LoRA Adapters , 2023, ArXiv.

[3]  Tao Gui,et al.  Orthogonal Subspace Learning for Language Model Continual Learning , 2023, EMNLP.

[4]  Yuanshao Zhu,et al.  MOELoRA: An MOE-based Parameter Efficient Fine-Tuning Method for Multi-task Medical Applications , 2023, ArXiv.

[5]  A. Ustun,et al.  Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning , 2023, ArXiv.

[6]  Bill Yuchen Lin,et al.  LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition , 2023, ArXiv.

[7]  Roberta Raileanu,et al.  Challenges and Applications of Large Language Models , 2023, ArXiv.

[8]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[9]  Hao Peng,et al.  KoLA: Carefully Benchmarking World Knowledge of Large Language Models , 2023, ICLR.

[10]  Z. Chen,et al.  Lifelong Language Pretraining with Distribution-Specialized Experts , 2023, ICML.

[11]  Omri Abend,et al.  DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering , 2022, ACL.

[12]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[13]  Hongsheng Li,et al.  ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition , 2022, NeurIPS.

[14]  Colin Raffel,et al.  Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning , 2022, NeurIPS.

[15]  Li Dong,et al.  On the Representation Collapse of Sparse Mixture of Experts , 2022, NeurIPS.

[16]  Mona T. Diab,et al.  A Review on Language Models as Knowledge Bases , 2022, ArXiv.

[17]  Haitao Zheng,et al.  Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Models , 2022, ArXiv.

[18]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[19]  Xipeng Qiu,et al.  Black-Box Tuning for Language-Model-as-a-Service , 2022, ICML.

[20]  Quoc V. Le,et al.  GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , 2021, ICML.

[21]  Li Dong,et al.  VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[22]  Graham Neubig,et al.  Towards a Unified View of Parameter-Efficient Transfer Learning , 2021, ICLR.

[23]  Yang You,et al.  Go Wider Instead of Deeper , 2021, AAAI.

[24]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[25]  Carlos Riquelme,et al.  Scaling Vision with Sparse Mixture of Experts , 2021, NeurIPS.

[26]  Lidong Bing,et al.  On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation , 2021, ACL.

[27]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[28]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[29]  Sebastian Riedel,et al.  Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets , 2020, EACL.

[30]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.

[31]  Jianfeng Gao,et al.  The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding , 2020, ACL.

[32]  Colin Raffel,et al.  How Much Knowledge Can You Pack into the Parameters of a Language Model? , 2020, EMNLP.

[33]  Zijian Wang,et al.  Answering Complex Open-domain Questions Through Iterative Query Generation , 2019, EMNLP.

[34]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[35]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[36]  Sung Ju Hwang,et al.  Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data , 2019, ACL.

[37]  Philipp Koehn,et al.  The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English , 2019, EMNLP.

[38]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[39]  Xiaodong Liu,et al.  ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[40]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[41]  Dan Roth,et al.  Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[42]  Mitesh M. Khapra,et al.  DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension , 2018, ACL.

[43]  Neal J. Cohen,et al.  A Closer Look at the Hippocampus and Memory , 2017, Trends in Cognitive Sciences.

[44]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[45]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[46]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[47]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[48]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[49]  E. Rolls,et al.  Computational analysis of the role of the hippocampus in memory , 1994, Hippocampus.

[50]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[51]  Yihan Cao,et al.  Instruction Mining: High-Quality Instruction Data Selection for Large Language Models , 2023, ArXiv.

[52]  Dragomir R. Radev,et al.  Crosslingual Generalization through Multitask Finetuning , 2023, ACL.

[53]  Yejin Choi,et al.  An Adversarial Winograd Schema Challenge at Scale , 2019 .