A study of generative large language model for medical research and healthcare

There is enormous enthusiasm and concerns in using large language models (LLMs) in healthcare, yet current assumptions are all based on general-purpose LLMs such as ChatGPT. This study develops a clinical generative LLM, GatorTronGPT, using 277 billion words of mixed clinical and English text with a GPT-3 architecture of 20 billion parameters. GatorTronGPT improves biomedical natural language processing for medical research. Synthetic NLP models trained using GatorTronGPT generated text outperform NLP models trained using real-world clinical text. Physicians Turing test using 1 (worst) to 9 (best) scale shows that there is no significant difference in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p<0.001). This study provides insights on the opportunities and challenges of LLMs for medical research and healthcare.

[1]  J. Egger,et al.  An Opinion on ChatGPT in Health Care—Written by Humans Only , 2023, The Journal of Nuclear Medicine.

[2]  Sébastien Bubeck,et al.  Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. , 2023, The New England journal of medicine.

[3]  M. Cascella,et al.  Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios , 2023, Journal of Medical Systems.

[4]  A. Katz,et al.  The Exciting Potential for ChatGPT in Obstetrics and Gynecology. , 2023, American journal of obstetrics and gynecology.

[5]  H. Hutchings,et al.  Using ChatGPT to write patient clinic letters. , 2023, The Lancet. Digital health.

[6]  Y. Harada,et al.  Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study , 2023, International journal of environmental research and public health.

[7]  Kyle Lam,et al.  ChatGPT: the future of discharge summaries? , 2023, The Lancet. Digital health.

[8]  Hyung Won Chung,et al.  Large language models encode clinical knowledge , 2022, Nature.

[9]  Colin B. Compas,et al.  A large language model for electronic health records , 2022, npj Digital Medicine.

[10]  Shenmin Zhang,et al.  BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining , 2022, Briefings Bioinform..

[11]  D. McCormick,et al.  Medical Documentation Burden Among US Office-Based Physicians in 2019: A National Study. , 2022, JAMA internal medicine.

[12]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[13]  Tao Qin,et al.  Discovering Drug-Target Interaction Knowledge from Biomedical Literature , 2021, Bioinform..

[14]  Serguei V. S. Pakhomov,et al.  Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition , 2021, J. Am. Medical Informatics Assoc..

[15]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[16]  Di Jin,et al.  What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[17]  Roberto Navigli,et al.  REBEL: Relation Extraction By End-to-end Language generation , 2021, EMNLP.

[18]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[19]  Yoav Levine,et al.  The Depth-to-Width Interplay in Self-Attention. , 2020 .

[20]  Luciano Floridi,et al.  GPT-3: Its Nature, Scope, Limits, and Consequences , 2020, Minds and Machines.

[21]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[22]  Xi Yang,et al.  Identifying relations of medications with adverse drug events using recurrent convolutional neural networks and gradient boosting , 2019, J. Am. Medical Informatics Assoc..

[23]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[24]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[25]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[26]  Philip J. Kroth,et al.  Association of Electronic Health Record Design and Use Factors With Clinician Stress and Burnout , 2019, JAMA network open.

[27]  Rajesh Ranganath,et al.  ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission , 2019, ArXiv.

[28]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[29]  D. Bates,et al.  Physician Burnout in the Electronic Health Record Era: Are We Ignoring the Real Cause? , 2018, Annals of Internal Medicine.

[30]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[31]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[32]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[33]  Anna Rumshisky,et al.  Evaluating temporal relations in clinical text: 2012 i2b2 Challenge , 2013, J. Am. Medical Informatics Assoc..

[34]  K. Gwet,et al.  A comparison of Cohen’s Kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples , 2013, BMC Medical Research Methodology.

[35]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..