Multimodal Representation Learning via Maximization of Local Mutual Information

We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning. Our code is available at: https://github.com/RayRuizhiLiao/mutual_info_img_txt.

[1]  Polina Golland,et al.  Deep Learning to Quantify Pulmonary Edema in Chest Radiographs , 2020, ArXiv.

[2]  Roger G. Mark,et al.  MIMIC-CXR: A large publicly available database of labeled chest radiographs , 2019, ArXiv.

[3]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[4]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[5]  Yuan Xue,et al.  Improved Disease Classification in Chest X-Rays with Transferred Features from Report Generation , 2019, IPMI.

[6]  Tanveer F. Syeda-Mahmood,et al.  Bimodal Network Architectures for Automatic Generation of Image Annotation from Text , 2018, MICCAI.

[7]  Steven Horng,et al.  MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports , 2019, Scientific Data.

[8]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[9]  Jacob Andreas,et al.  Joint Modeling of Chest Radiographs and Radiology Reports for Pulmonary Edema Assessment , 2020, MICCAI.

[10]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[11]  Pascal Vincent,et al.  Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[12]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[13]  P. Golland,et al.  DEMI: Discriminative Estimator of Mutual Information , 2020, ArXiv.

[14]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[15]  Stefano Ermon,et al.  Understanding the Limitations of Variational Mutual Information Estimators , 2020, ICLR.

[16]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[17]  Armand Joulin,et al.  Unsupervised Learning by Predicting Noise , 2017, ICML.

[18]  Yifan Yu,et al.  CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison , 2019, AAAI.

[19]  Clement J. McDonald,et al.  Preparing a collection of radiology examinations for distribution and retrieval , 2015, J. Am. Medical Informatics Assoc..

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  James R. Glass,et al.  Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input , 2018, ECCV.

[22]  William Wells,et al.  Semi-supervised Learning for Quantification of Pulmonary Edema in Chest X-Ray Images , 2019, ArXiv.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Christopher D. Manning,et al.  Contrastive Learning of Medical Visual Representations from Paired Images and Text , 2020, MLHC.

[25]  Paul A. Viola,et al.  Multi-modal volume registration by maximization of mutual information , 1996, Medical Image Anal..

[26]  Guy Marchal,et al.  Multimodality image registration by maximization of mutual information , 1997, IEEE Transactions on Medical Imaging.

[27]  Ronald M. Summers,et al.  TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-Rays , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Roger B. Grosse,et al.  Isolating Sources of Disentanglement in Variational Autoencoders , 2018, NeurIPS.

[29]  Tanveer F. Syeda-Mahmood,et al.  A Cross-Modality Neural Network Transform for Semi-automatic Medical Image Annotation , 2016, MICCAI.

[30]  Polina Golland,et al.  Pulmonary Edema Severity Estimation in Chest Radiographs Using Deep Learning , 2019 .

[31]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.