Variational Hetero-Encoder Randomized Generative Adversarial Networks for Joint Image-Text Modeling

For bidirectional joint image-text modeling, we develop variational hetero-encoder (VHE) randomized generative adversarial network (GAN) that integrates a probabilistic text decoder, probabilistic image encoder, and GAN into a coherent end-to-end multi-modality learning framework. VHE randomized GAN (VHE-GAN) encodes an image to decode its associated text, and feeds the variational posterior as the source of randomness into the GAN image generator. We plug three off-the-shelf modules, including a deep topic model, a ladder-structured image encoder, and StackGAN++, into VHE-GAN, which already achieves competitive performance. This further motivates the development of VHE-raster-scan-GAN that generates photo-realistic images in not only a multi-scale low-to-high-resolution manner, but also a hierarchical-semantic coarse-to-fine fashion. By capturing and relating hierarchical semantic and visual concepts with end-to-end training, VHE-raster-scan-GAN achieves state-of-the-art performance in a wide variety of image-text multi-modality learning and generation tasks. PyTorch code is provided.

[1]  Anton van den Hengel,et al.  Less is More: Zero-Shot Learning from Online Textual Documents with Noise Suppression , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Lawrence Carin,et al.  Negative Binomial Process Count and Mixture Modeling. , 2012, IEEE transactions on pattern analysis and machine intelligence.

[3]  Piyush Rai,et al.  Generalized Zero-Shot Learning via Synthesized Examples , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[5]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[8]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[9]  Timothy Baldwin,et al.  Topically Driven Neural Language Model , 2017, ACL.

[10]  Yihong Gong,et al.  Multi-Document Summarization using Sentence-based Topic Models , 2009, ACL.

[11]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[12]  Sebastian Nowozin,et al.  Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[13]  Charles A. Sutton,et al.  VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[14]  David B. Dunson,et al.  Beta-Negative Binomial Process and Poisson Factor Analysis , 2011, AISTATS.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[18]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Yoshua Bengio,et al.  Mode Regularized Generative Adversarial Networks , 2016, ICLR.

[20]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[21]  Chong Wang,et al.  TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency , 2016, ICLR.

[22]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[23]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[24]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[27]  Diederik P. Kingma,et al.  Stochastic Gradient VB and the Variational Auto-Encoder , 2013 .

[28]  Ahmed M. Elgammal,et al.  Link the Head to the "Beak": Zero Shot Learning from Noisy Text Description at Part Precision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Mingyuan Zhou,et al.  Augmentable Gamma Belief Networks , 2016, J. Mach. Learn. Res..

[30]  Mingyuan Zhou,et al.  Multimodal Poisson Gamma Belief Network , 2018, AAAI.

[31]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[32]  Xi Peng,et al.  A Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Tieniu Tan,et al.  IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis , 2018, NeurIPS.

[34]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[35]  Mingyuan Zhou,et al.  Infinite Edge Partition Models for Overlapping Community Detection and Link Prediction , 2015, AISTATS.

[36]  Changshui Zhang,et al.  Aligning where to see and what to tell: image caption with region-based attention and scene factorization , 2015, ArXiv.

[37]  Babak Saleh,et al.  Write a Classifier: Predicting Visual Classifiers from Unstructured Text , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Trevor Darrell,et al.  Adversarial Feature Learning , 2016, ICLR.

[39]  David Vázquez,et al.  PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[40]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[41]  Stefano Ermon,et al.  Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generative Models , 2017, AAAI.

[42]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[43]  Nitish Srivastava,et al.  Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[44]  Shaogang Gong,et al.  Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content , 2017, IEEE Signal Processing Magazine.

[45]  Wei-Lun Chao,et al.  Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Csaba Szepesvári,et al.  Deep Representations and Codes for Image Auto-Annotation , 2012, NIPS.

[47]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[48]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[49]  Hongwei Liu,et al.  Deep Latent Dirichlet Allocation with Topic-Layer-Adaptive Stochastic Gradient Riemannian MCMC , 2017, ICML.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[52]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[53]  C. V. Jawahar,et al.  Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[55]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Hao Zhang,et al.  WHAI: Weibull Hybrid Autoencoding Inference for Deep Topic Modeling , 2018, ICLR.