论文信息 - Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering - 字舞流文

Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering

Large Language Model (LLM) has gained popularity and achieved remarkable results in open-domain tasks, but its performance in real industrial domain-specific scenarios is average since there is no specific knowledge in it. This issue has attracted widespread attention, but there are few relevant benchmarks available. In this paper, we provide a benchmark Question Answering (QA) dataset named MSQA, which is about Microsoft products and IT technical problems encountered by customers. This dataset contains industry cloud-specific QA knowledge, which is not available for general LLM, so it is well suited for evaluating methods aimed at improving domain-specific capabilities of LLM. In addition, we propose a new model interaction paradigm that can empower LLM to achieve better performance on domain-specific tasks where it is not proficient. Extensive experiments demonstrate that the approach following our model fusion framework outperforms the commonly used LLM with retrieval methods.

Fan Yang | Qingwei Lin | Pu Zhao | Dongmei Zhang | Lu Wang | Jue Zhang | Mohit Garg | Zezhong Wang

[1] Sébastien Bubeck,et al. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. , 2023, The New England journal of medicine.

[2] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[3] A. S. Pimentel,et al. Do Large Language Models Understand Chemistry? A Conversation with ChatGPT , 2023, J. Chem. Inf. Model..

[4] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.

[5] Chenfei Wu,et al. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models , 2023, ArXiv.

[6] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7] Michel Galley,et al. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback , 2023, ArXiv.

[8] Yann LeCun,et al. Augmented Language Models: a Survey , 2023, Trans. Mach. Learn. Res..

[9] A. Lecler,et al. Revolutionizing radiology with GPT-based models: Current applications, future possibilities and limitations of ChatGPT. , 2023, Diagnostic and interventional imaging.

[10] M. Lewis,et al. REPLUG: Retrieval-Augmented Black-Box Language Models , 2023, NAACL.

[11] Song-Chun Zhu,et al. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , 2022, NeurIPS.

[12] Jane A. Yu,et al. Few-shot Learning with Retrieval Augmented Language Models , 2022, J. Mach. Learn. Res..

[13] Zhen Wang. Modern Question Answering Datasets and Benchmarks: A Survey , 2022, ArXiv.

[14] Ankit Pal,et al. MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , 2022, CHIL.

[15] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[16] Sameena Shah,et al. FinQA: A Dataset of Numerical Reasoning over Financial Data , 2021, EMNLP.

[17] Daniel E. Ho,et al. When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings , 2021, ICAIL.

[18] Ming-Wei Chang,et al. Retrieval Augmented Language Model Pre-Training , 2020, ICML.

[19] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[20] Danqi Chen,et al. Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[21] William W. Cohen,et al. PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[22] Ming-Wei Chang,et al. Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[23] Wenhan Xiong,et al. TWEETQA: A Social Media Focused Question Answering Dataset , 2019, ACL.

[24] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[25] Yoshua Bengio,et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[26] Oren Etzioni,et al. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[27] Philip Bachman,et al. NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[28] Jianfeng Gao,et al. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , 2016, CoCo@NIPS.

[29] Jason Weston,et al. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations , 2015, ICLR.

[30] Nelson Cowan,et al. Domain-general and domain-specific functional networks in working memory , 2014, NeuroImage.

[31] Hugo Zaragoza,et al. The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[32] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[33] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[34] Lynette Hirschman,et al. Natural language question answering: the view from here , 2001, Natural Language Engineering.

[35] D. Klahr,et al. The interaction of domain-specific knowledge and domain-general discovery strategies: a study with sinking objects. , 1996, Child development.

[36] Xu Tan,et al. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face , 2023, NeurIPS.

[37] Robert S. Siegler,et al. How Domain-General and Domain-Specific Knowledge Interact to Produce Strategy Choices. , 1989 .