论文信息 - Almanac: Retrieval-Augmented Language Models for Clinical Medicine - 字舞流文

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Large-language models have recently demonstrated impressive zero-shot capabilities in a variety of natural language tasks such as summarization, dialogue generation, and question-answering. Despite many promising applications in clinical medicine, adoption of these models in real-world settings has been largely limited by their tendency to generate incorrect and sometimes even toxic statements. In this study, we develop Almanac, a large language model framework augmented with retrieval capabilities for medical guideline and treatment recommendations. Performance on a novel dataset of clinical scenarios (n= 130) evaluated by a panel of 5 board-certified and resident physicians demonstrates significant increases in factuality (mean of 18% at p-value < 0.05) across all specialties, with improvements in completeness and safety. Our results demonstrate the potential for large language models to be effective tools in the clinical decision-making process, while also emphasizing the importance of careful testing and deployment to mitigate their shortcomings.

Michael Moor | Alex R. Dalal | W. Hiesinger | R. Shad | Akash Chaurasia | C. Langlotz | C. Zakka | Euan A. Ashley | Kathleen Boyd | Jennifer Kim | Kevin Alexander | Jack Boyd | Karen Hirsch | Joanna Nelson

[1] Micah J. Smith,et al. Do We Still Need Clinical Language Models? , 2023, CHIL.

[2] Luke Zettlemoyer,et al. Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[3] Hyung Won Chung,et al. Large language models encode clinical knowledge , 2022, Nature.

[4] Tom B. Brown,et al. Constitutional AI: Harmlessness from AI Feedback , 2022, ArXiv.

[5] Guillem Cucurull,et al. Galactica: A Large Language Model for Science , 2022, ArXiv.

[6] Jimmy Ba,et al. Large Language Models Are Human-Level Prompt Engineers , 2022, ICLR.

[7] Christopher D. Manning,et al. Deep Bidirectional Language-Knowledge Graph Pretraining , 2022, NeurIPS.

[8] Shenmin Zhang,et al. BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining , 2022, Briefings Bioinform..

[9] O. Winther,et al. Can large language models reason about medical questions? , 2022, Patterns.

[10] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[11] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[12] Ankit Pal,et al. MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , 2022, CHIL.

[13] Dipankar Ray,et al. ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection , 2022, ACL.

[14] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[15] Pascale Fung,et al. Survey of Hallucination in Natural Language Generation , 2022, ACM Comput. Surv..

[16] Colin B. Compas,et al. GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records , 2022, medRxiv.

[17] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[18] Renelito Delos Santos,et al. LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[19] Jeff Wu,et al. WebGPT: Browser-assisted question-answering with human feedback , 2021, ArXiv.

[20] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[21] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[22] Sang Michael Xie,et al. Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning , 2021, NeurIPS.

[23] Marcin Junczys-Dowmunt,et al. The Curious Case of Hallucinations in Neural Machine Translation , 2021, NAACL.

[24] Colin Raffel,et al. Extracting Training Data from Large Language Models , 2020, USENIX Security Symposium.

[25] Di Jin,et al. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[26] Jianfeng Gao,et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[27] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[28] Yannis Papanikolaou,et al. DARE: Data Augmented Relation Extraction with GPT-2 , 2020, ArXiv.

[29] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[30] William W. Cohen,et al. PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[31] Iz Beltagy,et al. SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[32] Jaewoo Kang,et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[33] Adam Tauman Kalai,et al. What are the Biases in My Word Embedding? , 2018, AIES.

[34] Shane Legg,et al. Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[35] Yury A. Malkov,et al. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Arjen Hoogendam,et al. Answers to Questions Posed During Daily Patient Care Are More Likely to Be Answered by UpToDate Than PubMed , 2008, Journal of medical Internet research.

[37] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[38] K. Chard,et al. ScholarBERT: Bigger is Not Always Better , 2022, ArXiv.

[39] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.