Medical Diagnosis with Large Scale Multimodal Transformers: Leveraging Diverse Data for More Accurate Diagnosis

Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data. However, these models suffer from scaling issues: they have to learn pairwise interactions between each piece of information in each data type, thereby escalating model complexity beyond manageable scales. This has so far precluded a widespread use of multimodal deep learning. Here, we present a new technical approach of “learnable synergies”, in which the model only selects relevant interactions between data modalities and keeps an “internal memory” of relevant data. Our approach is easily scalable and naturally adapts to multimodal data inputs from clinical routine. We demonstrate this approach on three large multimodal datasets from radiology and ophthalmology and show that it outperforms state-of-the-art models in clinically relevant diagnosis tasks. Our new approach is transferable and will allow the application of multimodal deep learning to a broad set of clinically relevant problems.

[1]  Jakob Nikolas Kather,et al.  Artificial Intelligence for Clinical Interpretation of Bedside Chest Radiographs. , 2022, Radiology.

[2]  Jakob Nikolas Kather,et al.  Image prediction of disease progression for osteoarthritis by style-based manifold extrapolation , 2022, Nature Machine Intelligence.

[3]  Ming Y. Lu,et al.  Artificial intelligence for multimodal data integration in oncology. , 2022, Cancer cell.

[4]  Jakob Nikolas Kather,et al.  Artificial intelligence in histopathology: enhancing cancer research and clinical oncology , 2022, Nature Cancer.

[5]  A. McPherson,et al.  Multimodal data integration using machine learning improves risk stratification of high-grade serous ovarian cancer , 2022, Nature Cancer.

[6]  Jianjiong Gao,et al.  Multimodal integration of radiology, pathology and genomics for prediction of response to PD-(L)1 blockade in patients with non-small cell lung cancer , 2022, Nature Cancer.

[7]  Jakob Nikolas Kather,et al.  Elevating Fundoscopic Evaluation to Expert Level - Automatic Glaucoma Detection Using Data from the Airogs Challenge , 2022, 2022 IEEE International Symposium on Biomedical Imaging Challenges (ISBIC).

[8]  Jakob Nikolas Kather,et al.  Adversarial attacks and adversarial robustness in computational pathology , 2022, bioRxiv.

[9]  Sang Min Lee,et al.  Self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation , 2022, Nature Communications.

[10]  Holger Roth,et al.  Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images , 2022, BrainLes@MICCAI.

[11]  Farah E. Shamout,et al.  Towards dynamic multi-modal phenotyping using chest radiographs and physiological data , 2021, ArXiv.

[12]  Jakob Nikolas Kather,et al.  Weakly supervised annotation‐free cancer detection and prediction of genotype in routine histopathology , 2021, The Journal of pathology.

[13]  Christian Wachinger,et al.  Combining 3D Image and Tabular Data via the Dynamic Affine Feature Map Transform , 2021, MICCAI.

[14]  Ari S. Morcos,et al.  ConViT: improving vision transformers with soft convolutional inductive biases , 2021, ICML.

[15]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[16]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[17]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[19]  Willem Zuidema,et al.  Quantifying Attention Flow in Transformers , 2020, ACL.

[20]  Marc'Aurelio Ranzato,et al.  Multi-scale Transformer Language Models , 2020, ArXiv.

[21]  Big hopes for big data , 2020, Nature Medicine.

[22]  Hamid Reza Vaezi Joze,et al.  MMTM: Multimodal Transfer Module for CNN Fusion , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Rainer Hofmann-Wellenhof,et al.  A deep learning system for differential diagnosis of skin diseases , 2019, Nature Medicine.

[24]  Dorit Merhof,et al.  Radiomic versus Convolutional Neural Networks Analysis for Classification of Contrast-enhancing Lesions at Multiparametric Breast MRI. , 2019, Radiology.

[25]  Roger G. Mark,et al.  MIMIC-CXR: A large publicly available database of labeled chest radiographs , 2019, ArXiv.

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Stephen Marshall,et al.  Activation Functions: Comparison of trends in Practice and Research for Deep Learning , 2018, ArXiv.

[28]  Geraint Rees,et al.  Clinically applicable deep learning for diagnosis and referral in retinal disease , 2018, Nature Medicine.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ramprasaath R. Selvaraju,et al.  Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization , 2016 .

[33]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Jia Deng,et al.  A large-scale hierarchical image database , 2009, CVPR 2009.

[35]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[36]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.