Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

Large Language Model (LLM) has gained popularity and achieved remarkable results in open-domain tasks, but its performance in real industrial domain-specific scenarios is average since there is no specific knowledge in it. This issue has attracted widespread attention, but there are few relevant benchmarks available. In this paper, we provide a benchmark Question Answering (QA) dataset named MSQA, which is about Microsoft products and IT technical problems encountered by customers. This dataset contains industry cloud-specific QA knowledge, which is not available for general LLM, so it is well suited for evaluating methods aimed at improving domain-specific capabilities of LLM. In addition, we propose a new model interaction paradigm that can empower LLM to achieve better performance on domain-specific tasks where it is not proficient. Extensive experiments demonstrate that the approach following our model fusion framework outperforms the commonly used LLM with retrieval methods.

[1]  Sébastien Bubeck,et al.  Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. , 2023, The New England journal of medicine.

[2]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[3]  A. S. Pimentel,et al.  Do Large Language Models Understand Chemistry? A Conversation with ChatGPT , 2023, J. Chem. Inf. Model..

[4]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[5]  Chenfei Wu,et al.  Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.

[6]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7]  Michel Galley,et al.  Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback , 2023, ArXiv.

[8]  Yann LeCun,et al.  Augmented Language Models: a Survey , 2023, Trans. Mach. Learn. Res..

[9]  A. Lecler,et al.  Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT. , 2023, Diagnostic and interventional imaging.

[10]  M. Lewis,et al.  REPLUG: Retrieval-Augmented Black-Box Language Models , 2023, NAACL.

[11]  Song-Chun Zhu,et al.  Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , 2022, NeurIPS.

[12]  Jane A. Yu,et al.  Few-shot Learning with Retrieval Augmented Language Models , 2022, J. Mach. Learn. Res..

[13]  Zhen Wang Modern Question Answering Datasets and Benchmarks: A Survey , 2022, ArXiv.

[14]  Ankit Pal,et al.  MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , 2022, CHIL.

[15]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[16]  Sameena Shah,et al.  FinQA: A Dataset of Numerical Reasoning over Financial Data , 2021, EMNLP.

[17]  Daniel E. Ho,et al.  When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings , 2021, ICAIL.

[18]  Ming-Wei Chang,et al.  Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[19]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[20]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[21]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[22]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[23]  Wenhan Xiong,et al.  TWEETQA: A Social Media Focused Question Answering Dataset , 2019, ACL.

[24]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[25]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[26]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[27]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[28]  Jianfeng Gao,et al.  MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[29]  Jason Weston,et al.  The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations , 2015, ICLR.

[30]  Nelson Cowan,et al.  Domain-general and domain-specific functional networks in working memory , 2014, NeuroImage.

[31]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[32]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34]  Lynette Hirschman,et al.  Natural language question answering: the view from here , 2001, Natural Language Engineering.

[35]  D. Klahr,et al.  The interaction of domain-specific knowledge and domain-general discovery strategies: a study with sinking objects. , 1996, Child development.

[36]  Xu Tan,et al.  HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , 2023, NeurIPS.

[37]  Robert S. Siegler,et al.  How Domain-General and Domain-Specific Knowledge Interact to Produce Strategy Choices. , 1989 .