PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

In this paper, we focus on the problem of Medical Visual Question Answering (MedVQA), which is crucial in efficiently interpreting medical images with vital clinic-relevant information. Firstly, we reframe the problem of MedVQA as a generation task that naturally follows the human-machine interaction, we propose a generative-based model for medical visual understanding by aligning visual information from a pre-trained vision encoder with a large language model. Secondly, we establish a scalable pipeline to construct a large-scale medical visual question-answering dataset, named PMC-VQA, which contains 227k VQA pairs of 149k images that cover various modalities or diseases. Thirdly, we pre-train our proposed model on PMC-VQA and then fine-tune it on multiple public benchmarks, e.g., VQA-RAD and SLAKE, outperforming existing work by a large margin. Additionally, we propose a test set that has undergone manual verification, which is significantly more challenging, even the best models struggle to solve.

[1]  Weidi Xie,et al.  PMC-LLaMA: Further Finetuning LLaMA on Medical Papers , 2023, ArXiv.

[2]  Mohamed Elhoseiny,et al.  MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models , 2023, ArXiv.

[3]  E. Horvitz,et al.  Capabilities of GPT-4 on Medical Challenge Problems , 2023, ArXiv.

[4]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[5]  Weidi Xie,et al.  PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents , 2023, MICCAI.

[6]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[7]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[8]  Hyung Won Chung,et al.  Large language models encode clinical knowledge , 2022, Nature.

[9]  Tiffany H. Kung,et al.  Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models , 2022, medRxiv.

[10]  Tsung-Hui Chang,et al.  Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training , 2022, MICCAI.

[11]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[12]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[13]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[14]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[15]  Gholamreza Haffari,et al.  Medical Visual Question Answering: A Survey , 2021, Artif. Intell. Medicine.

[16]  Bjoern H Menze,et al.  The Medical Segmentation Decathlon , 2021, Nature Communications.

[17]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[18]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[19]  Sameer Singh,et al.  MedICaT: A Dataset of Medical Images, Captions, and Textual References , 2020, FINDINGS.

[20]  Di Jin,et al.  What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams , 2020, Applied Sciences.

[21]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[22]  Anna Rumshisky,et al.  Towards Visual Dialog for Radiology , 2020, BIONLP.

[23]  Andreas Nürnberger,et al.  CHAOS Challenge - Combined (CT-MR) Healthy Abdominal Organ Segmentation , 2020, Medical Image Anal..

[24]  Thanh-Toan Do,et al.  Overcoming Data Limitation in Medical Visual Question Answering , 2019, MICCAI.

[25]  Asma Ben Abacha,et al.  Descriptor : A dataset of clinically generated visual questions and answers about radiology images , 2018 .

[26]  Johannes Rückert,et al.  Radiology Objects in COntext (ROCO): A Multimodal Image Dataset , 2018, CVII-STENT/LABELS@MICCAI.

[27]  Dong Huang,et al.  Optimal Gradient Checkpoint Search for Arbitrary Computation Graphs , 2018, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[29]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[30]  R. J. Roberts PubMed Central: The GenBank of the published literature. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Pengtao Xie,et al.  Towards Visual Question Answering on Pathology Images , 2021, ACL.

[32]  Xiao-Ming Wu,et al.  Contrastive Pre-training and Representation Distillation for Medical Visual Question Answering Based on Radiology Images , 2021, International Conference on Medical Image Computing and Computer-Assisted Intervention.

[33]  Henning Müller,et al.  Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain , 2020, CLEF.

[34]  Peter G. Anderson,et al.  PEIR Digital Library: Online Resources and Authoring System , 2001, AMIA.