Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images. This does not only introduce poor alignment between the modalities but also a missed opportunity to exploit rich self-supervision through existing temporal content in the data. In this work, we explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model. It is designed to be versatile to arising challenges such as pose variations and missing input images across time. The resulting model excels on downstream tasks both in single- and multi-image setups, achieving state-of-the-art performance on (I) progression classification, (II) phrase grounding, and (III) report generation, whilst offering consistent improvements on disease classification and sentence-similarity tasks. We release a novel multi-modal temporal benchmark dataset, MS-CXR-T, to quantify the quality of vision-language representations in terms of temporal semantics. Our experimental results show the advantages of incorporating prior images and reports to make most use of the data.

[1]  P. Rajpurkar,et al.  Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors , 2022, ML4H@NeurIPS.

[2]  Joy T. Wu,et al.  CheXRelNet: An Anatomy-Aware Model for Tracking Longitudinal Relationships between Chest X-Rays , 2022, MICCAI.

[3]  Jinwoo Shin,et al.  Time Is MattEr: Temporal Self-supervision for Video Transformers , 2022, ICML.

[4]  Mingtao Pei,et al.  Clinical-BERT: Vision-Language Pre-training for Radiograph Diagnosis and Reports Generation , 2022, AAAI.

[5]  Zirui Wang,et al.  CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[6]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[7]  Stephanie L. Hyland,et al.  Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing , 2022, ECCV.

[8]  Songkuk Kim,et al.  How Do Vision Transformers Work? , 2022, ICLR.

[9]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[10]  R Devon Hjelm,et al.  Robust Contrastive Learning against Noisy Views , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Marcus Rohrbach,et al.  FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Zhenguo Li,et al.  FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.

[13]  D. Sontag,et al.  Leveraging Time Irreversibility with Order-Contrastive Pre-training , 2021, AISTATS.

[14]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Bing Liu,et al.  Zero-Shot Out-of-Distribution Detection Based on the Pre-trained Model CLIP , 2021, AAAI.

[16]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[17]  Jong Hak Moon,et al.  Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training , 2021, IEEE Journal of Biomedical and Health Informatics.

[18]  Christopher D. Manning,et al.  Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[19]  Jianfeng Gao,et al.  Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing , 2020, ACM Trans. Comput. Heal..

[20]  Li Dong,et al.  VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[21]  S. Yeung,et al.  GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Juergen Gall,et al.  Long Short View Feature Decomposition via Contrastive Video Representation Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Yiyu Shi,et al.  Contrastive Learning with Temporal Correlated Medical Images: A Case Study using Lung Segmentation in Chest X-Rays (Invited Paper) , 2021, 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD).

[24]  Roy Schwartz,et al.  Data Efficient Masked Language Modeling for Vision and Language , 2021, EMNLP.

[25]  Hongyang Chao,et al.  Rethinking and Improving Relative Position Encoding for Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[27]  Matthew P. Lungren,et al.  RadGraph: Extracting Clinical Entities and Relations from Radiology Reports , 2021, NeurIPS Datasets and Benchmarks.

[28]  Stanislav Fort,et al.  Exploring the Limits of Out-of-Distribution Detection , 2021, NeurIPS.

[29]  Andrew Zisserman,et al.  Broaden Your Views for Self-Supervised Video Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Ari S. Morcos,et al.  ConViT: improving vision transformers with soft convolutional inductive biases , 2021, ICML.

[31]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[32]  Andrew Y. Ng,et al.  MedAug: Contrastive learning leveraging patient metadata improves representations for chest X-ray interpretation , 2021, MLHC.

[33]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[34]  Farah E. Shamout,et al.  COVID-19 Prognosis via Self-Supervised Representation Learning and Multi-Image Prediction , 2021, ArXiv.

[35]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[36]  Yuhao Zhang,et al.  Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation , 2020, NAACL.

[37]  Christopher D. Manning,et al.  Biomedical and Clinical English Model Packages in the Stanza Python NLP Library , 2020, ArXiv.

[38]  Justin Johnson,et al.  VirTex: Learning Visual Representations from Textual Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Andrew Y. Ng,et al.  Retrieval-Based Chest X-Ray Report Generation Using a Pre-trained Contrastive Language-Image Model , 2021, ML4H@NeurIPS.

[40]  Daniel J. McDuff,et al.  Contrastive Learning of Global and Local Video Representations , 2021, NeurIPS.

[41]  Tsung-Hui Chang,et al.  Generating Radiology Reports via Memory-driven Transformer , 2020, EMNLP.

[42]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[43]  Y. Khader,et al.  Chest x-ray findings and temporal lung changes in patients with COVID-19 pneumonia , 2020, BMC Pulmonary Medicine.

[44]  Ching-Yao Chuang,et al.  Debiased Contrastive Learning , 2020, NeurIPS.

[45]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[46]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[47]  Wenzhong Shi,et al.  Change Detection Based on Artificial Intelligence: State-of-the-Art and Challenges , 2020, Remote. Sens..

[48]  Willem Zuidema,et al.  Quantifying Attention Flow in Transformers , 2020, ACL.

[49]  Andrew Y. Ng,et al.  CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT , 2020, EMNLP.

[50]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[51]  Kyong Joon Lee,et al.  Longitudinal Change Detection on Chest X-rays Using Geometric Correlation Maps , 2019, MICCAI.

[52]  Joseph O Deasy,et al.  Towards Predicting the Evolution of Lung Tumors During Radiotherapy Observed on a Longitudinal MR Imaging Study Via a Deep Learning Algorithm. , 2019, Medical physics.

[53]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[54]  Yongjun Zhang,et al.  End-to-End Change Detection for High Resolution Satellite Images Using Improved UNet++ , 2019, Remote. Sens..

[55]  T. Coroller,et al.  Deep Learning Predicts Lung Cancer Treatment Response from Serial Medical Imaging , 2019, Clinical Cancer Research.

[56]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[57]  Carol C Wu,et al.  Augmenting the National Institutes of Health Chest Radiograph Dataset with Expert Annotations of Possible Pneumonia. , 2019, Radiology. Artificial intelligence.

[58]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[59]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[60]  Alexandre Boulch,et al.  Fully Convolutional Siamese Networks for Change Detection , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[61]  Giovanni Montana,et al.  Longitudinal detection of radiological abnormalities with time-modulated LSTM , 2018, DLMIA/ML-CDS@MICCAI.

[62]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[63]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[64]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[65]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[66]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Clement J. McDonald,et al.  Preparing a collection of radiology examinations for distribution and retrieval , 2015, J. Am. Medical Informatics Assoc..

[68]  Luis Ibáñez,et al.  The Design of SimpleITK , 2013, Front. Neuroinform..

[69]  Guido Gerig,et al.  Toward a Comprehensive Framework for the Spatiotemporal Statistical Analysis of Longitudinal Shape Data , 2013, International Journal of Computer Vision.

[70]  Jing-Wein Wang,et al.  A nonparametric-based rib suppression method for chest radiographs , 2012, Comput. Math. Appl..

[71]  Brian B. Avants,et al.  Symmetric diffeomorphic image registration with cross-correlation: Evaluating automated labeling of elderly and neurodegenerative brain , 2008, Medical Image Anal..

[72]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[73]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[74]  Michael Unser,et al.  Optimization of mutual information for multiresolution image registration , 2000, IEEE Trans. Image Process..

[75]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[76]  K Berbaum,et al.  Influence of prior radiologic information on the interpretation of radiographic examinations. , 1995, Academic radiology.