Contrastive Cross-Modal Pre-Training: A General Strategy for Small Sample Medical Imaging.

A key challenge in training neural networks for a given medical imaging task is often the difficulty of obtaining a sufficient number of manually labeled examples. In contrast, textual imaging reports, which are often readily available in medical records, contain rich but unstructured interpretations written by experts as part of standard clinical practice. We propose using these textual reports as a form of weak supervision to improve the image interpretation performance of a neural network without requiring additional manually labeled examples. We use an image-text matching task to train a feature extractor and then fine-tune it in a transfer learning setting for a supervised task using a small labeled dataset. The end result is a neural network that automatically interprets imagery without requiring textual reports during inference. This approach can be applied to any task for which text-image pairs are readily available. We evaluate our method on three classification tasks and find consistent performance improvements, reducing the need for labeled data by 67%--98%.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[3]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yu Zhang,et al.  Defense-PointNet: Protecting PointNet Against Adversarial Attacks , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jianxin Wang,et al.  Multi-scale Deep Convolutional Neural Network for Stroke Lesions Segmentation on CT Images , 2018, BrainLes@MICCAI.

[9]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[10]  Lin Wu,et al.  Where-and-When to Look: Deep Siamese Attention Networks for Video-Based Person Re-Identification , 2018, IEEE Transactions on Multimedia.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Shuang Yu,et al.  Comparing to Learn: Surpassing ImageNet Pretraining on Radiographs By Comparing Image Representations , 2020, MICCAI.

[13]  Jianfeng Gao,et al.  VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training , 2020, ArXiv.

[14]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[15]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Daguang Xu,et al.  Weakly supervised one-stage vision and language disease detection using large scale pneumonia and pneumothorax studies , 2020, MICCAI.

[17]  Jin Chen,et al.  GANai: Standardizing CT Images using Generative Adversarial Network with Alternative Improvement , 2018, bioRxiv.

[18]  Thorsten Joachims,et al.  Learning a Distance Metric from Relative Comparisons , 2003, NIPS.

[19]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  M. Lungren,et al.  Preparing Medical Imaging Data for Machine Learning. , 2020, Radiology.

[22]  Xuanjing Huang,et al.  How to Fine-Tune BERT for Text Classification? , 2019, CCL.

[23]  Nathan Jacobs,et al.  Automatic Hand Skeletal Shape Estimation from Radiographs , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[24]  Ronald M. Summers,et al.  NegBio: a high-performance tool for negation and uncertainty detection in radiology reports , 2017, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[25]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Bram van Ginneken,et al.  A survey on deep learning in medical image analysis , 2017, Medical Image Anal..

[27]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Y Zhang,et al.  A deep learning view of the census of galaxy clusters in IllustrisTNG , 2020, Monthly Notices of the Royal Astronomical Society.

[29]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[30]  Multi-Modal Data Analysis for Alzheimer’s Disease Diagnosis: An Ensemble Model Using Imagery and Genetic Features , 2021 .

[31]  Michael H. Goldbaum,et al.  Labeled Optical Coherence Tomography (OCT) and Chest X-Ray Images for Classification , 2018 .

[32]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[33]  Fumiyo Fukumoto,et al.  MSCNN: A Monomeric-Siamese Convolutional Neural Network for Extremely Imbalanced Multi-label Text Classification , 2020, EMNLP.

[34]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  M. Shamim Hossain,et al.  MetaCOVID: A Siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients , 2020, Pattern Recognition.

[36]  Laurens van der Maaten,et al.  Self-Supervised Learning of Pretext-Invariant Representations , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Ioannis A. Kakadiaris,et al.  Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[39]  Chunxiao Liu,et al.  Graph Structured Network for Image-Text Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Thomas Brox,et al.  U-Net: deep learning for cell counting, detection, and morphometry , 2018, Nature Methods.

[41]  Tanveer Syeda-Mahmood,et al.  Automatic Bounding Box Annotation of Chest X-Ray Data for Localization of Abnormalities , 2020, 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI).

[42]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[43]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[44]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[45]  Rodrigo C. Barros,et al.  Bidirectional Retrieval Made Simple , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Roger G. Mark,et al.  MIMIC-CXR: A large publicly available database of labeled chest radiographs , 2019, ArXiv.

[47]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[48]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[49]  Jiebo Luo,et al.  Automatic Radiology Report Generation based on Multi-view Image Fusion and Medical Concept Enrichment , 2019, MICCAI.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Hui Li,et al.  Transfer Learning From Convolutional Neural Networks for Computer-Aided Diagnosis: A Comparison of Digital Breast Tomosynthesis and Full-Field Digital Mammography. , 2019, Academic radiology.

[52]  Hunter Blanton,et al.  Inconsistent Performance of Deep Learning Models on Mammogram Classification. , 2020, Journal of the American College of Radiology : JACR.

[53]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[54]  Yaohang Li,et al.  Clinical big data and deep learning: Applications, challenges, and future outlooks , 2019, Big Data Min. Anal..

[55]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[56]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[57]  Xiaoqin Wang,et al.  Improved Trainable Calibration Method for Neural Networks on Medical Imaging Classification , 2020, ArXiv.

[58]  Robert Pless,et al.  Learning Geo-Temporal Image Features , 2019, BMVC.

[59]  Masoumeh Haghpanahi,et al.  Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network , 2019, Nature Medicine.

[60]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[61]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[62]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[63]  Hunter Blanton,et al.  2D Convolutional Neural Networks for 3D Digital Breast Tomosynthesis Classification , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[64]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[66]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[67]  Tanveer F. Syeda-Mahmood,et al.  Bimodal Network Architectures for Automatic Generation of Image Annotation from Text , 2018, MICCAI.

[68]  Hunter Blanton,et al.  Joint 2D-3D Breast Cancer Classification , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[69]  Nathan Jacobs,et al.  Alzheimer’s Disease Classification Using 2D Convolutional Neural Networks , 2021, 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).

[70]  Hunter Blanton,et al.  Dynamic Image for 3D MRI Image Alzheimer's Disease Classification , 2020, ECCV Workshops.

[71]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[72]  Scott Workman,et al.  Analyzing human appearance as a cue for dating images , 2016, 2016 IEEE Winter Conference on Applications of Computer Vision (WACV).

[73]  Yu Zhang,et al.  Weakly-Supervised Self-Training for Breast Cancer Localization* , 2020, 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC).

[74]  Subhashini Venugopalan,et al.  Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. , 2016, JAMA.