Towards Expert-Level Medical Question Answering with Large Language Models

Recent artificial intelligence (AI) systems have reached milestones in"grand challenges"ranging from Go to protein-folding. The capability to retrieve medical knowledge, reason over it, and answer medical questions comparably to physicians has long been viewed as one such grand challenge. Large language models (LLMs) have catalyzed significant progress in medical question answering; Med-PaLM was the first model to exceed a"passing"score in US Medical Licensing Examination (USMLE) style questions with a score of 67.2% on the MedQA dataset. However, this and other prior work suggested significant room for improvement, especially when models' answers were compared to clinicians' answers. Here we present Med-PaLM 2, which bridges these gaps by leveraging a combination of base LLM improvements (PaLM 2), medical domain finetuning, and prompting strategies including a novel ensemble refinement approach. Med-PaLM 2 scored up to 86.5% on the MedQA dataset, improving upon Med-PaLM by over 19% and setting a new state-of-the-art. We also observed performance approaching or exceeding state-of-the-art across MedMCQA, PubMedQA, and MMLU clinical topics datasets. We performed detailed human evaluations on long-form questions along multiple axes relevant to clinical applications. In pairwise comparative ranking of 1066 consumer medical questions, physicians preferred Med-PaLM 2 answers to those produced by physicians on eight of nine axes pertaining to clinical utility (p<0.001). We also observed significant improvements compared to Med-PaLM on every evaluation axis (p<0.001) on newly introduced datasets of 240 long-form"adversarial"questions to probe LLM limitations. While further studies are necessary to validate the efficacy of these models in real-world settings, these results highlight rapid progress towards physician-level performance in medical question answering.

[1]  W. Lee,et al.  ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models , 2023, Annals of surgical treatment and research.

[2]  J. Ayers,et al.  Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. , 2023, JAMA internal medicine.

[3]  Bodhisattwa Prasad Majumder,et al.  Self-Refine: Iterative Refinement with Self-Feedback , 2023, 2303.17651.

[4]  E. Horvitz,et al.  Capabilities of GPT-4 on Medical Challenge Problems , 2023, ArXiv.

[5]  D. Levine,et al.  The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model , 2023, medRxiv.

[6]  D. Duong,et al.  Analysis of large-language model versus human performance for genetics questions , 2023, medRxiv.

[7]  J. El-Khoury,et al.  Evaluating the Performance of ChatGPT in Ophthalmology , 2023, medRxiv.

[8]  Hyung Won Chung,et al.  Large language models encode clinical knowledge , 2022, Nature.

[9]  Luke Zettlemoyer,et al.  Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters , 2022, ArXiv.

[10]  Viorica Patraucean,et al.  Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task , 2022, ArXiv.

[11]  Shenmin Zhang,et al.  BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining , 2022, Briefings Bioinform..

[12]  O. Winther,et al.  Can large language models reason about medical questions? , 2022, Patterns.

[13]  Yuhuai Wu,et al.  Solving Quantitative Reasoning Problems with Language Models , 2022, NeurIPS.

[14]  Ankit Pal,et al.  MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , 2022, CHIL.

[15]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[16]  Geoffrey Irving,et al.  Red Teaming Language Models with Language Models , 2022, EMNLP.

[17]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[18]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[19]  Po-Sen Huang,et al.  Ethical and social risks of harm from Language Models , 2021, ArXiv.

[20]  Keith C. Norris,et al.  Health inequities and the inappropriate use of race in nephrology , 2021, Nature Reviews Nephrology.

[21]  N. Powe,et al.  New Creatinine- and Cystatin C-Based Equations to Estimate GFR without Race. , 2021, The New England journal of medicine.

[22]  Di Jin,et al.  What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[23]  Marzyeh Ghassemi,et al.  Ethical Machine Learning in Health Care , 2020, Annual review of biomedical data science.

[24]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[25]  Ruqaiijah Yearby Structural Racism and Health Disparities: Reconfiguring the Social Determinants of Health Framework to Include the Root Cause , 2020, Journal of Law, Medicine & Ethics.

[26]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[27]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[28]  David R. Williams,et al.  Understanding how discrimination can affect health , 2019, Health services research.

[29]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[30]  Mark Sharp,et al.  Bridging the Gap Between Consumers' Medication Questions and Trusted Answers , 2019, MedInfo.

[31]  M. Rigby Ethical Dimensions of Using Artificial Intelligence in Health Care , 2019, AMA Journal of Ethics.

[32]  Eugene Agichtein,et al.  Overview of the Medical Question Answering Task at TREC 2017 LiveQA , 2017, TREC.

[33]  Julie Cerese,et al.  The Reliability of AHRQ Common Format Harm Scales in Rating Patient Safety Events , 2015, Journal of patient safety.

[34]  Justus J. Randolph Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa. , 2005 .

[35]  Peter Szolovits,et al.  Categorical and Probabilistic Reasoning in Medicine Revisited , 1993, Artif. Intell..

[36]  E. Shortliffe Computer programs to support clinical decision making. , 1990, JAMA.

[37]  William B. Schwartz,et al.  Medicine and the Computer. The Promise and Problems of Change , 1970 .