Multimodal Self-Supervised Learning for Medical Image Analysis

Self-supervised learning approaches leverage unlabeled samples to acquire generic knowledge about different concepts, hence allowing for annotation-efficient downstream task learning. In this paper, we propose a novel self-supervised method that leverages multiple imaging modalities. We introduce the multimodal puzzle task, which facilitates rich representation learning from multiple image modalities. The learned representations allow for subsequent fine-tuning on different downstream tasks. To achieve that, we learn a modality-agnostic feature embedding by confusing image modalities at the data-level. Together with the Sinkhorn operator, with which we formulate the puzzle solving optimization as permutation matrix inference instead of classification, they allow for efficient solving of multimodal puzzles with varying levels of complexity. In addition, we also propose to utilize cross-modal generation techniques for multimodal data augmentation used for training self-supervised tasks. In other words, we exploit synthetic images for self-supervised pretraining, instead of downstream tasks directly, in order to circumvent quality issues associated with synthetic images, while improving data-efficiency and representations quality. Our experimental results, which assess the gains in downstream performance and data-efficiency, show that solving our multimodal puzzles yields better semantic representations, compared to treating each modality independently. Our results also highlight the benefits of exploiting synthetic images for self-supervised pretraining. We showcase our approach on four downstream tasks: Brain tumor segmentation and survival days prediction using four MRI modalities, Prostate segmentation using two MRI modalities, and Liver segmentation using unregistered CT and MRI modalities. We outperform many previous solutions, and achieve results competitive to state-of-the-art.

[1]  David Dagan Feng,et al.  Co-Learning Feature Fusion Maps From PET-CT Images of Lung Cancer , 2018, IEEE Transactions on Medical Imaging.

[2]  J. Alison Noble,et al.  Self-Supervised Representation Learning for Ultrasound Video , 2020, 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI).

[3]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[4]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[5]  Georg Langs,et al.  Annotating Medical Image Data , 2017, Cloud-Based Benchmarking of Medical Image Analysis.

[6]  Andrew Owens,et al.  Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[7]  William T. Freeman,et al.  A probabilistic image jigsaw puzzle solver , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  Mohammad Havaei,et al.  Jigsaw-VAE: Towards Balancing Features in Variational Autoencoders , 2020, ArXiv.

[9]  Hamid R. Rabiee,et al.  Puzzle-AE: Novelty Detection in Images through Solving Puzzles , 2020, ArXiv.

[10]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[11]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Andrew Zisserman,et al.  Self-supervised Learning for Spinal MRIs , 2017, DLMIA/ML-CDS@MICCAI.

[13]  Ryan P. Adams,et al.  Ranking via Sinkhorn Propagation , 2011, ArXiv.

[14]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[15]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Connie Chang A Patient’s Guide to Medical Imaging , 2011 .

[17]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[18]  Weiming Dong,et al.  Self-Supervised Feature Augmentation for Large Image Object Detection , 2020, IEEE Transactions on Image Processing.

[19]  Brian B. Avants,et al.  The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS) , 2015, IEEE Transactions on Medical Imaging.

[20]  Matthijs Douze,et al.  Deep Clustering for Unsupervised Learning of Visual Features , 2018, ECCV.

[21]  Ronald M. Summers,et al.  Deep Lesion Graph in the Wild: Relationship Learning and Organization of Significant Radiology Image Findings in a Diverse Large-Scale Lesion Database , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[22]  Nima Tajbakhsh,et al.  Models Genesis: Generic Autodidactic Models for 3D Medical Image Analysis , 2019, MICCAI.

[23]  David Picard,et al.  Image Reassembly Combining Deep Learning and Shortest Path Problem , 2018, ECCV.

[24]  Abhinav Gupta,et al.  Scaling and Benchmarking Self-Supervised Visual Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Ronald M. Summers,et al.  A large annotated medical image dataset for the development and evaluation of segmentation algorithms , 2019, ArXiv.

[26]  Andrew Zisserman,et al.  Emotion Recognition in Speech using Cross-Modal Transfer in the Wild , 2018, ACM Multimedia.

[27]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[28]  Nathan S. Netanyahu,et al.  A Generalized Genetic Algorithm-Based Solver for Very Large Jigsaw Puzzles of Complex Types , 2014, AAAI.

[29]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Mert R. Sabuncu,et al.  VoxelMorph: A Learning Framework for Deformable Medical Image Registration , 2018, IEEE Transactions on Medical Imaging.

[31]  James R. Glass,et al.  Detecting Depression with Audio/Text Sequence Modeling of Interviews , 2018, INTERSPEECH.

[32]  David B. Cooper,et al.  Solving Small-Piece Jigsaw Puzzles by Growing Consensus , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Andrew C. Gallagher Jigsaw puzzles with pieces of unknown orientation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Mohan S. Kankanhalli,et al.  Self-supervised Representation Learning Using 360° Data , 2019, ACM Multimedia.

[35]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Nima Tajbakhsh,et al.  Surrogate Supervision for Medical Image Analysis: Effective Deep Learning From Limited Quantities of Labeled Data , 2019, 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019).

[37]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[38]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[39]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[40]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Timo Dickscheid,et al.  Improving Cytoarchitectonic Segmentation of Human Brain Areas with Self-supervised Siamese Networks , 2018, MICCAI.

[42]  David B. Cooper,et al.  Solving Square Jigsaw Puzzles with Loop Constraints , 2014, ECCV.

[43]  Jiawei Wang,et al.  The Retrieval of the Beautiful: Self-Supervised Salient Object Detection for Beauty Product Retrieval , 2019, ACM Multimedia.

[44]  Mauricio Reyes,et al.  Deep Learning versus Classical Regression for Brain Tumor Patient Survival Prediction , 2018, BrainLes@MICCAI.

[45]  Aaron Carass,et al.  Unpaired Brain MR-to-CT Synthesis Using a Structure-Constrained CycleGAN , 2018, DLMIA/ML-CDS@MICCAI.

[46]  Abhinav Gupta,et al.  Pose from Action: Unsupervised Learning of Pose Features based on Motion , 2016, ArXiv.

[47]  Ronald M. Summers,et al.  Deep Lesion Graphs in the Wild: Relationship Learning and Organization of Significant Radiology Image Findings in a Diverse Large-Scale Lesion Database , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Alexei A. Efros,et al.  Unsupervised Domain Adaptation through Self-Supervision , 2019, ArXiv.

[49]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[50]  Richard Sinkhorn A Relationship Between Arbitrary Positive Matrices and Doubly Stochastic Matrices , 1964 .

[51]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[52]  Edward J. Delp,et al.  Three Dimensional Fluorescence Microscopy Image Synthesis and Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Yong Fan,et al.  Non-rigid image registration using self-supervised fully convolutional networks without training data , 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[55]  Bin Yang,et al.  MedGAN: Medical Image Translation using GANs , 2018, Comput. Medical Imaging Graph..

[56]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[57]  Jelmer M. Wolterink,et al.  Deep MR to CT Synthesis Using Unpaired Data , 2017, SASHIMI@MICCAI.

[58]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[59]  Scott W. Linderman,et al.  Learning Latent Permutations with Gumbel-Sinkhorn Networks , 2018, ICLR.

[60]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[61]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[62]  Yefeng Zheng,et al.  Self supervised deep representation learning for fine-grained body part recognition , 2017, 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017).

[63]  Ali Razavi,et al.  Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[64]  Antonio Torralba,et al.  Cross-Modal Scene Networks , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  Ben Glocker,et al.  Multi-modal Learning from Unpaired Images: Application to Multi-organ Segmentation in CT and MRI , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[66]  Euijoon Ahn,et al.  Unsupervised Domain Adaptation to Classify Medical Images Using Zero-Bias Convolutional Auto-Encoders and Context-Based Feature Augmentation , 2020, IEEE Transactions on Medical Imaging.

[67]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[68]  Björn Ommer,et al.  Cross and Learn: Cross-Modal Self-Supervision , 2018, GCPR.

[69]  Ke Yan,et al.  Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks , 2019, Scientific Reports.

[70]  Christos Davatzikos,et al.  Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features , 2017, Scientific Data.

[71]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[72]  Euijoon Ahn,et al.  Sparsity-based Convolutional Kernel Network for Unsupervised Medical Image Analysis , 2018, Medical image analysis.

[73]  Fabio Maria Carlucci,et al.  Domain Generalization by Solving Jigsaw Puzzles , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Nima Tajbakhsh,et al.  Embracing Imperfect Datasets: A Review of Deep Learning Solutions for Medical Image Segmentation , 2019, Medical Image Anal..

[75]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[76]  Anoop Cherian,et al.  DeepPermNet: Visual Permutation Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Chuan Li,et al.  Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks , 2016, ECCV.

[78]  James R. Glass,et al.  Unsupervised Learning of Spoken Language with Visual Context , 2016, NIPS.

[79]  D. Kong,et al.  Automatic Classification of Early Parkinson's Disease with Multi-Modal MR Imaging , 2012, PloS one.

[80]  Yuxing Tang,et al.  XLSor: A Robust and Accurate Lung Segmentor on Chest X-Rays Using Criss-Cross Attention and Customized Radiorealistic Abnormalities Generation , 2018, MIDL.

[81]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[82]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Alexander Margulis,et al.  A Patient's Guide to Medical Imaging , 2010 .

[84]  Stefano Ermon,et al.  Stochastic Optimization of Sorting Networks via Continuous Relaxations , 2019, ICLR.

[85]  Klaus H. Maier-Hein,et al.  No New-Net , 2018, 1809.10483.

[86]  Lin Yang,et al.  Translating and Segmenting Multimodal Medical Volumes with Cycle- and Shape-Consistency Generative Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[87]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[88]  Yujiu Yang,et al.  Self-supervised Feature Learning for 3D Medical Images by Playing a Rubik's Cube , 2019, MICCAI.

[89]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[90]  Gözde B. Ünal,et al.  Deshufflegan: A Self-Supervised Gan to Improve Structure Learning , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[91]  Kyomin Jung,et al.  Multimodal Speech Emotion Recognition Using Audio and Text , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[92]  Christoph Lippert,et al.  3D Self-Supervised Methods for Medical Imaging , 2020, NeurIPS.