Joint Multimodal Learning with Deep Generative Models

We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally.

[1]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[2]  Ambedkar Dukkipati,et al.  Variational methods for conditional multimodal deep learning , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[3]  Max Welling,et al.  The Variational Fair Autoencoder , 2015, ICLR.

[4]  R. Zemel,et al.  THE VARIATIONAL FAIR AUTO ENCODER , 2015 .

[5]  Ambedkar Dukkipati,et al.  Variational methods for Conditional Multimodal Learning: Generating Human Faces from Attributes , 2016, ArXiv.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[8]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[9]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[10]  Honglak Lee,et al.  Improved Multimodal Deep Learning with Variation of Information , 2014, NIPS.

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[13]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[14]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[15]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[16]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[17]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[18]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[19]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[20]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[22]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[23]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[24]  Colin Raffel,et al.  Lasagne: First release. , 2015 .