Large Language Models Vote: Prompting for Rare Disease Identification

The emergence of generative Large Language Models (LLMs) emphasizes the need for accurate and efficient prompting approaches. LLMs are often applied in Few-Shot Learning (FSL) contexts, where tasks are executed with minimal training data. FSL has become popular in many Artificial Intelligence (AI) subdomains, including AI for health. Rare diseases affect a small fraction of the population. Rare disease identification from clinical notes inherently requires FSL techniques due to limited data availability. Manual data collection and annotation is both expensive and time-consuming. In this paper, we propose Models-Vote Prompting (MVP), a flexible prompting approach for improving the performance of LLM queries in FSL settings. MVP works by prompting numerous LLMs to perform the same tasks and then conducting a majority vote on the resulting outputs. This method achieves improved results to any one model in the ensemble on one-shot rare disease identification and classification tasks. We also release a novel rare disease dataset for FSL, available to those who signed the MIMIC-IV Data Use Agreement (DUA). Furthermore, in using MVP, each model is prompted multiple times, substantially increasing the time needed for manual annotation, and to address this, we assess the feasibility of using JSON for automating generative LLM evaluation.

[1]  Eric Michael Smith,et al.  Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.

[2]  Rui Zhang,et al.  Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations , 2023, ArXiv.

[3]  Daniel A. Epstein,et al.  Understanding the Benefits and Challenges of Deploying Conversational AI Leveraging Large Language Models for Public Health Intervention , 2023, CHI.

[4]  D. Truhn,et al.  MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data , 2023, ArXiv.

[5]  Wayne Xin Zhao,et al.  A Survey of Large Language Models , 2023, ArXiv.

[6]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7]  Douglas C. Schmidt,et al.  A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT , 2023, ArXiv.

[8]  Mayee F. Chen,et al.  Ask Me Anything: A simple strategy for prompting language models , 2022, ICLR.

[9]  Marcus Hutter,et al.  Formal Algorithms for Transformers , 2022, ArXiv.

[10]  Jiaoyan Chen,et al.  Ontology-driven and weakly supervised rare disease identification from clinical notes , 2022, BMC Medical Informatics and Decision Making.

[11]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[12]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[13]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jonathan Berant,et al.  Learning To Retrieve Prompts for In-Context Learning , 2021, NAACL.

[15]  Timo Schick,et al.  True Few-Shot Learning with Prompts—A Real-World Perspective , 2021, Transactions of the Association for Computational Linguistics.

[16]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[17]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[18]  Omer Levy,et al.  Few-Shot Question Answering by Pretraining Span Selection , 2021, ACL.

[19]  Jianfeng Gao,et al.  Few-Shot Named Entity Recognition: A Comprehensive Study , 2020, ArXiv.

[20]  Veselin Stoyanov,et al.  Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art , 2020, CLINICALNLP.

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[23]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[24]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25]  Jiashi Feng,et al.  PANet: Few-Shot Image Semantic Segmentation With Prototype Alignment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[27]  Brejesh Lall,et al.  Few Shot Speaker Recognition using Deep Neural Networks , 2019, ArXiv.

[28]  Matthew Henderson,et al.  A Repository of Conversational Datasets , 2019, Proceedings of the First Workshop on NLP for Conversational AI.

[29]  Hadi Kharrazi,et al.  The Value of Unstructured Electronic Health Record Data in Geriatric Syndrome Case Identification , 2018, Journal of the American Geriatrics Society.

[30]  J. Brooks,et al.  New Paradigms for Patient-Centered Outcomes Research in Electronic Medical Records: An Example of Detecting Urinary Incontinence Following Prostatectomy , 2016, EGEMS.

[31]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[32]  Rob Koeling,et al.  Optimising the use of electronic health records to estimate the incidence of rheumatoid arthritis in primary care: what information is hidden in free text? , 2013, BMC Medical Research Methodology.

[33]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[36]  L. Breiman Random Forests , 2001, Machine Learning.