Towards Generalist Biomedical AI

Medicine is inherently multimodal, with rich data modalities spanning text, imaging, genomics, and more. Generalist biomedical artificial intelligence (AI) systems that flexibly encode, integrate, and interpret this data at scale can potentially enable impactful applications ranging from scientific discovery to care delivery. To enable the development of these models, we first curate MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses 14 diverse tasks such as medical question answering, mammography and dermatology image interpretation, radiology report generation and summarization, and genomic variant calling. We then introduce Med-PaLM Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI system. Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same set of model weights. Med-PaLM M reaches performance competitive with or exceeding the state of the art on all MultiMedBench tasks, often surpassing specialist models by a wide margin. We also report examples of zero-shot generalization to novel medical concepts and tasks, positive transfer learning across tasks, and emergent zero-shot medical reasoning. To further probe the capabilities and limitations of Med-PaLM M, we conduct a radiologist evaluation of model-generated (and human) chest X-ray reports and observe encouraging performance across model scales. In a side-by-side ranking on 246 retrospective chest X-rays, clinicians express a pairwise preference for Med-PaLM M reports over those produced by radiologists in up to 40.50% of cases, suggesting potential clinical utility. While considerable work is needed to validate these models in real-world use cases, our results represent a milestone towards the development of generalist biomedical AI systems.

[1]  Chunyuan Li,et al.  LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day , 2023, NeurIPS.

[2]  David J. Fleet,et al.  Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging , 2023, Nature Biomedical Engineering.

[3]  P. Ellinor,et al.  Transfer learning enables predictions in network biology , 2023, Nature.

[4]  Kai Zhang,et al.  BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks , 2023, ArXiv.

[5]  Vivek Natarajan,et al.  Towards Expert-Level Medical Question Answering with Large Language Models , 2023, ArXiv.

[6]  Dave Van Veen,et al.  RadAdapt: Radiology Report Summarization via Lightweight Domain Adaptation of Large Language Models , 2023, BIONLP.

[7]  D. Rueckert,et al.  Interactive and Explainable Region-guided Radiology Report Generation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Luping Zhou,et al.  Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder , 2023, IPMI.

[9]  J. Leskovec,et al.  Foundation models for generalist medical artificial intelligence , 2023, Nature.

[10]  P. Rajpurkar,et al.  Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation , 2023, ArXiv.

[11]  Hongsheng Li,et al.  LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention , 2023, ArXiv.

[12]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[13]  Cees G. M. Snoek,et al.  Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models , 2023, MICCAI.

[14]  Mehdi S. M. Sajjadi,et al.  PaLM-E: An Embodied Multimodal Language Model , 2023, ICML.

[15]  Mohamad Mahmoud Al Rahhal,et al.  Vision–Language Model for Visual Question Answering in Medical Imagery , 2023, Bioengineering.

[16]  Sjoerd van Steenkiste,et al.  Scaling Vision Transformers to 22 Billion Parameters , 2023, ICML.

[17]  Luke Zettlemoyer,et al.  Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[18]  Timo I. Denk,et al.  MusicLM: Generating Music From Text , 2023, ArXiv.

[19]  Stephanie L. Hyland,et al.  Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Hyung Won Chung,et al.  Large language models encode clinical knowledge , 2022, Nature.

[21]  Jean-Benoit Delbrouck,et al.  Toward Expanding the Scope of Radiology Report Summarization to Multiple Anatomies and Modalities , 2022, ACL.

[22]  M. Marshall The future of general practice in England , 2022, British medical journal.

[23]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[24]  Minjoon Seo,et al.  Retrieval of Soft Prompt Enhances Zero-Shot Task Generalization , 2022, ArXiv.

[25]  P. Rajpurkar,et al.  Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors , 2022, ML4H@NeurIPS.

[26]  Ravidu Suien Rammuni Silva,et al.  Effective Utilization of Multiple Convolutional Neural Networks for Chest X-Ray Classification , 2022, SN Computer Science.

[27]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[28]  David Grangier,et al.  AudioLM: A Language Modeling Approach to Audio Generation , 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[29]  A. Ng,et al.  Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation , 2022, medRxiv.

[30]  William T. Harvey,et al.  A draft human pangenome reference , 2022, bioRxiv.

[31]  Aniruddha Kembhavi,et al.  Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks , 2022, ICLR.

[32]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[33]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[34]  Ian S. Fischer,et al.  Multi-Game Decision Transformers , 2022, NeurIPS.

[35]  Leandro M. de Lima,et al.  Exploring Advances in Transformers and CNN for Skin Lesion Diagnosis on Small Datasets , 2022, BRACIS.

[36]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[37]  Sergio Gomez Colmenarejo,et al.  A Generalist Agent , 2022, Trans. Mach. Learn. Res..

[38]  Min Wang,et al.  Deeply Supervised Skin Lesions Diagnosis with Stage and Branch Attention , 2022, 2205.04326.

[39]  Adarsh Bhandary Panambur,et al.  Effect of Random Histogram Equalization on Breast Calcification Analysis Using Deep Learning , 2022, Bildverarbeitung für die Medizin.

[40]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[41]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[42]  Ankit Pal,et al.  MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering , 2022, CHIL.

[43]  Laurent El Shafey,et al.  Pathways: Asynchronous Distributed Dataflow for ML , 2022, MLSys.

[44]  M. Dao,et al.  VinDr-Mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography , 2022, medRxiv.

[45]  J. Dowling,et al.  Improving Chest X-Ray Report Generation by Leveraging Warm-Starting , 2022, Artif. Intell. Medicine.

[46]  Christoph Meinel,et al.  Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain? , 2021, ArXiv.

[47]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Serestina Viriri,et al.  Ensemble of EfficientNets for the Diagnosis of Tuberculosis , 2021, Comput. Intell. Neurosci..

[49]  R. Roela,et al.  Breast Cancer Diagnosis in Two-View Mammography Using End-to-End Trained EfficientNet-Based Convolutional Network , 2021, IEEE Access.

[50]  David J. Fleet,et al.  Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.

[51]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[52]  Michael S. Bernstein,et al.  On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[53]  Matthew P. Lungren,et al.  RadGraph: Extracting Clinical Entities and Relations from Radiology Reports , 2021, NeurIPS Datasets and Benchmarks.

[54]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[55]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Dacheng Tao,et al.  ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias , 2021, NeurIPS.

[57]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[58]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[59]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[60]  Bo Liu,et al.  Slake: A Semantically-Labeled Knowledge-Enhanced Dataset For Medical Visual Question Answering , 2021, 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI).

[61]  E. V. Van Allen,et al.  Detection of Pathogenic Variants With Germline Genetic Testing Using Deep Learning vs Standard Methods in Patients With Prostate Cancer and Melanoma. , 2020, JAMA.

[62]  Kai Wang,et al.  PrecisionFDA Truth Challenge V2: Calling variants from short and long reads in difficult-to-map regions , 2020, bioRxiv.

[63]  Tsung-Hui Chang,et al.  Generating Radiology Reports via Memory-driven Transformer , 2020, EMNLP.

[64]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[65]  Yuhao Zhang,et al.  Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation , 2020, NAACL.

[66]  Di Jin,et al.  What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[67]  André G. C. Pacheco,et al.  PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones , 2020, Data in brief.

[68]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[69]  Andrew Y. Ng,et al.  CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT , 2020, EMNLP.

[70]  Eric Xing,et al.  PathVQA: 30000+ Questions for Medical Visual Question Answering , 2020, ArXiv.

[71]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[72]  David S. Melnick,et al.  International evaluation of an AI system for breast cancer screening , 2020, Nature.

[73]  Steven Horng,et al.  MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports , 2019, Scientific Data.

[74]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[75]  William W. Cohen,et al.  PubMedQA: A Dataset for Biomedical Research Question Answering , 2019, EMNLP.

[76]  Suman V. Ravuri,et al.  A Clinically Applicable Approach to Continuous Prediction of Future Acute Kidney Injury , 2019, Nature.

[77]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Peter Szolovits,et al.  Clinically Accurate Chest X-Ray Report Generation , 2019, MLHC.

[79]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[80]  Asma Ben Abacha,et al.  Descriptor : A dataset of clinically generated visual questions and answers about radiology images , 2018 .

[81]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[82]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[83]  Daniel L Rubin,et al.  A curated mammography data set for use in computer-aided detection and diagnosis research , 2017, Scientific Data.

[84]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[85]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[86]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[87]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[88]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.

[89]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[90]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[91]  Alexa B. R. McIntyre,et al.  Extensive sequencing of seven human genomes to characterize benchmark reference materials , 2015, Scientific Data.

[92]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[93]  E. Goyder,et al.  Referral interventions from primary to specialist care: a systematic review of international evidence. , 2014, The British journal of general practice : the journal of the Royal College of General Practitioners.

[94]  Stefan Jaeger,et al.  Two public chest X-ray datasets for computer-aided screening of pulmonary diseases. , 2014, Quantitative imaging in medicine and surgery.

[95]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[97]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[98]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[99]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[100]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[101]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[102]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[103]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[104]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[105]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[106]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[107]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[108]  P. Rajpurkar,et al.  BenchMD: A Benchmark for Modality-Agnostic Learning on Medical Images and Sensors , 2023, ArXiv.

[109]  Jared A. Dunnmon,et al.  ViLMedic: a framework for research at the intersection of vision and language in medical AI , 2022, ACL.

[110]  Andrew Y. Ng,et al.  Retrieval-Based Chest X-Ray Report Generation Using a Pre-trained Contrastive Language-Image Model , 2021, ML4H@NeurIPS.

[111]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .

[112]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[113]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[114]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.