Learning Multimodal VAEs through Mutual Supervision

Multimodal variational autoencoders (VAEs) seek to model the joint distribution over heterogeneous data (e.g. vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the Mutually supErvised Multimodal VAE (MEME), that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing—something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image–image) and CUB (image–text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.

[1]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[2]  Jes Frellsen,et al.  MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets , 2019, ICML.

[3]  Mike Wu,et al.  Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Yarin Gal,et al.  Understanding Measures of Uncertainty for Adversarial Example Detection , 2018, UAI.

[6]  Philip H. S. Torr,et al.  FLIPDIAL: A Generative Model for Two-Way Visual Dialogue , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Dmitry Vetrov,et al.  Variational Autoencoder with Arbitrary Conditioning , 2018, ICLR.

[8]  C. Givens,et al.  A class of Wasserstein metrics for probability distributions. , 1984 .

[9]  Philip H. S. Torr,et al.  Capturing Label Characteristics in VAEs , 2021, ICLR.

[10]  Chris Holmes,et al.  Deep Generative Missingness Pattern-Set Mixture Models , 2021, AISTATS.

[11]  Kevin Murphy,et al.  Generative Models of Visually Grounded Imagination , 2017, ICLR.

[12]  Masahiro Suzuki,et al.  Joint Multimodal Learning with Deep Generative Models , 2016, ICLR.

[13]  Philip H. S. Torr,et al.  Variational Mixture-of-Experts Autoencoders for Multi-Modal Deep Generative Models , 2019, NeurIPS.

[14]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[15]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[16]  Pablo M. Olmos,et al.  Handling Incomplete Heterogeneous Data using VAEs , 2018, Pattern Recognit..

[17]  Julia E. Vogt,et al.  Multimodal Generative Learning Utilizing Jensen-Shannon-Divergence , 2019, ViGIL@NeurIPS.

[18]  Philip H. S. Torr,et al.  Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models , 2020, ICLR.

[19]  Max Welling,et al.  VAE with a VampPrior , 2017, AISTATS.

[20]  Julia E. Vogt,et al.  Generalized Multimodal ELBO , 2021, ICLR.

[21]  Frank D. Wood,et al.  Learning Disentangled Representations with Semi-Supervised Deep Generative Models , 2017, NIPS.

[22]  Sebastian Nowozin,et al.  EDDI: Efficient Dynamic Discovery of High-Value Information with Partial VAE , 2018, ICML.

[23]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[24]  Jiawei He,et al.  Variational Selective Autoencoder: Learning from Partially-Observed Heterogeneous Data , 2021, AISTATS.