Clinical Camel: An Open-Source Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding

Large Language Models (LLMs) present immense potential in the medical field, yet concerns over data privacy, regulatory compliance, and model stability restrict their widespread adoption. Although the distillation of high-performing closed-source LLMs has proven effective for general tasks, their application in healthcare is limited due to reduced domain knowledge and remnants of alignment behavior hindering clinical tasks. To address these challenges, we propose Dialogue-Based Knowledge Encoding (DBKE). DBKE enhances models' implicit knowledge base and primes them for conversational recall, augmenting their conversational capabilities and enabling a soft alignment for subsequent use cases. By transforming dense academic source text into synthetic dialogue, DBKE broadens the model's knowledge base and enables a soft alignment that guides downstream behaviours. We present Clinical Camel, an open-source, healthcare-focused conversational model, to showcase the effectiveness of DBKE. Clinical Camel outperforms GPT-3.5 on the United States Medical Licensing Examination (USMLE) Step 1 and Step 3 with scores of 53.2 % and 58.2 %, respectively, compared to GPT-3.5's scores of 36.1 % and 55.7 %. Clinical Camel adeptly handles multi-stage clinical case problems, provides adaptive counseling, and generates clinical notes. However, it is prone to hallucinations, which pose a significant obstacle in safety-critical settings. The performance of Clinical Camel underscores the importance of continued research and development of open-source models for the safe and effective integration of LLMs in healthcare settings.

[1]  Vivek Natarajan,et al.  Towards Expert-Level Medical Question Answering with Large Language Models , 2023, ArXiv.

[2]  John Giorgi,et al.  Clinical Note Generation from Doctor-Patient Conversations using Large Language Models: Insights from MEDIQA-Chat , 2023, 2305.02220.

[3]  Weidi Xie,et al.  PMC-LLaMA: Further Finetuning LLaMA on Medical Papers , 2023, ArXiv.

[4]  D. Truhn,et al.  MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data , 2023, ArXiv.

[5]  J. Egger,et al.  ChatGPT in Healthcare: A Taxonomy and Systematic Review , 2023, medRxiv.

[6]  Sébastien Bubeck,et al.  Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. , 2023, The New England journal of medicine.

[7]  You Zhang,et al.  ChatDoctor: A Medical Chat Model Fine-tuned on LLaMA Model using Medical Domain Knowledge , 2023, ArXiv.

[8]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[9]  E. Horvitz,et al.  Capabilities of GPT-4 on Medical Challenge Problems , 2023, ArXiv.

[10]  S. Harrer Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine , 2023, EBioMedicine.

[11]  J. Patrinely,et al.  On the cusp: Considering the impact of artificial intelligence language models in healthcare. , 2023, Med.

[12]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[13]  Tiffany H. Kung,et al.  Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models , 2022, medRxiv.

[14]  Pascale Fung,et al.  Survey of Hallucination in Natural Language Generation , 2022, ACM Comput. Surv..

[15]  Yadan Fan,et al.  An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters , 2023, EACL.

[16]  Hyung Won Chung,et al.  Large language models encode clinical knowledge , 2022, Nature.

[17]  Noah A. Smith,et al.  Self-Instruct: Aligning Language Model with Self Generated Instructions , 2022, ArXiv.

[18]  David Bau,et al.  Mass-Editing Memory in a Transformer , 2022, ICLR.

[19]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[20]  Daniel Y. Fu,et al.  FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , 2022, NeurIPS.

[21]  Ankit Pal,et al.  MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , 2022, CHIL.

[22]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[23]  Jason Weston,et al.  Retrieval Augmentation Reduces Hallucination in Conversation , 2021, EMNLP.

[24]  Di Jin,et al.  What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[25]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[26]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[27]  R. Krause College of Family Physicians of Canada. , 1983, Canadian Medical Association journal.