Improving Domain Adaptation through Extended-Text Reading Comprehension

To enhance the domain-specific capabilities of large language models, continued pre-training on a domain-specific corpus is a prevalent method. Recent work demonstrates that adapting models using reading comprehension data formatted by regex-based patterns can significantly improve performance on domain-specific tasks. However, regex-based patterns are incapable of parsing raw corpora using domain-specific knowledge. Furthermore, the question and answer pairs are extracted directly from the corpus in predefined formats offers limited context. To address this limitation, we improve reading comprehension via LLM and clustering. LLM focuses on leveraging domain knowledge within the corpus to refine comprehension stage, while clustering supplies relevant knowledge by extending the context to enrich reading stage. Additionally, our method incorporates parameter-efficient fine-tuning to improve the efficiency of domain adaptation. In comparison to AdaptLLM, our method achieves an improvement exceeding 5% in domain-specific tasks. Our code will available at https://github.com/microsoft/LMOps.

[1]  S. Oberman,et al.  ChipNeMo: Domain-Adapted LLMs for Chip Design , 2023, ArXiv.

[2]  Zhongyu Wei,et al.  DISC-FinLLM: A Chinese Financial Large Language Model based on Multiple Experts Fine-tuning , 2023, ArXiv.

[3]  Sewon Min,et al.  In-Context Pretraining: Language Modeling Beyond Document Boundaries , 2023, ArXiv.

[4]  Furu Wei,et al.  Adapting Large Language Models via Reading Comprehension , 2023, ArXiv.

[5]  Omer Levy,et al.  LIMA: Less Is More for Alignment , 2023, NeurIPS.

[6]  Can Xu,et al.  WizardLM: Empowering Large Language Models to Follow Complex Instructions , 2023, ArXiv.

[7]  D. Truhn,et al.  MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data , 2023, ArXiv.

[8]  P. Kambadur,et al.  BloombergGPT: A Large Language Model for Finance , 2023, ArXiv.

[9]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[10]  William Yang Wang,et al.  ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering , 2022, EMNLP.

[11]  A. Shashua,et al.  The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design , 2021, ICLR.

[12]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[13]  Li Dong,et al.  Knowledge Neurons in Pretrained Transformers , 2021, ACL.

[14]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[15]  Di Jin,et al.  What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[16]  Ankur Sinha,et al.  Impact of News on the Commodity Market: Dataset and Results , 2020, Advances in Intelligent Systems and Computing.

[17]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[18]  Xavier Amatriain,et al.  Effective Transfer Learning for Identifying Similar Questions: Matching User Questions to COVID-19 FAQs , 2020, KDD.

[19]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[20]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[21]  André Freitas,et al.  WWW'18 Open Challenge: Financial Opinion Mining and Question Answering , 2018, WWW.

[22]  Franck Dernoncourt,et al.  PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts , 2017, IJCNLP.

[23]  Jianfeng Gao,et al.  MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[24]  Pekka Korhonen,et al.  Good debt or bad debt: Detecting semantic orientations in economic texts , 2013, J. Assoc. Inf. Sci. Technol..

[25]  K. Lu Can ChatGPT Help College Instructors Generate High-quality Quiz Questions? , 2023, Human Interaction and Emerging Technologies (IHIET-AI 2023): Artificial Intelligence and Future Applications.

[26]  Andrew M. Olney Generating Multiple Choice Questions from a Textbook: LLMs Match Human Performance on Most Metrics , 2023, LLM@AIED.

[27]  Timothy Baldwin,et al.  Domain Adaption of Named Entity Recognition to Support Credit Risk Assessment , 2015, ALTA.