论文信息 - Variational Hetero-Encoder Randomized Generative Adversarial Networks for Joint Image-Text Modeling

Variational Hetero-Encoder Randomized Generative Adversarial Networks for Joint Image-Text Modeling

For bidirectional joint image-text modeling, we develop variational hetero-encoder (VHE) randomized generative adversarial network (GAN) that integrates a probabilistic text decoder, probabilistic image encoder, and GAN into a coherent end-to-end multi-modality learning framework. VHE randomized GAN (VHE-GAN) encodes an image to decode its associated text, and feeds the variational posterior as the source of randomness into the GAN image generator. We plug three off-the-shelf modules, including a deep topic model, a ladder-structured image encoder, and StackGAN++, into VHE-GAN, which already achieves competitive performance. This further motivates the development of VHE-raster-scan-GAN that generates photo-realistic images in not only a multi-scale low-to-high-resolution manner, but also a hierarchical-semantic coarse-to-fine fashion. By capturing and relating hierarchical semantic and visual concepts with end-to-end training, VHE-raster-scan-GAN achieves state-of-the-art performance in a wide variety of image-text multi-modality learning and generation tasks. PyTorch code is provided.

Hao Zhang | Mingyuan Zhou | Bo Chen | Zhengjue Wang | Long Tian

[1] Anton van den Hengel,et al. Less is More: Zero-Shot Learning from Online Textual Documents with Noise Suppression , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Lawrence Carin,et al. Negative Binomial Process Count and Mixture Modeling. , 2012, IEEE transactions on pattern analysis and machine intelligence.

[3] Piyush Rai,et al. Generalized Zero-Shot Learning via Synthesized Examples , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4] Ole Winther,et al. Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[5] Bernt Schiele,et al. Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Lin Yang,et al. Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7] Nitish Srivastava,et al. Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[8] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[9] Timothy Baldwin,et al. Topically Driven Neural Language Model , 2017, ACL.

[10] Yihong Gong,et al. Multi-Document Summarization using Sentence-based Topic Models , 2009, ACL.

[11] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[12] Sebastian Nowozin,et al. Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[13] Charles A. Sutton,et al. VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[14] David B. Dunson,et al. Beta-Negative Binomial Process and Poisson Factor Analysis , 2011, AISTATS.

[15] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[17] Bernhard Schölkopf,et al. Wasserstein Auto-Encoders , 2017, ICLR.

[18] Xiaogang Wang,et al. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19] Yoshua Bengio,et al. Mode Regularized Generative Adversarial Networks , 2016, ICLR.

[20] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[21] Chong Wang,et al. TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency , 2016, ICLR.

[22] Chong Wang,et al. Stochastic variational inference , 2012, J. Mach. Learn. Res..

[23] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[24] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[26] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[27] Diederik P. Kingma,et al. Stochastic Gradient VB and the Variational Auto-Encoder , 2013 .

[28] Ahmed M. Elgammal,et al. Link the Head to the "Beak": Zero Shot Learning from Noisy Text Description at Part Precision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Mingyuan Zhou,et al. Augmentable Gamma Belief Networks , 2016, J. Mach. Learn. Res..

[30] Mingyuan Zhou,et al. Multimodal Poisson Gamma Belief Network , 2018, AAAI.

[31] Navdeep Jaitly,et al. Adversarial Autoencoders , 2015, ArXiv.

[32] Xi Peng,et al. A Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Tieniu Tan,et al. IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis , 2018, NeurIPS.

[34] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[35] Mingyuan Zhou,et al. Infinite Edge Partition Models for Overlapping Community Detection and Link Prediction , 2015, AISTATS.

[36] Changshui Zhang,et al. Aligning where to see and what to tell: image caption with region-based attention and scene factorization , 2015, ArXiv.

[37] Babak Saleh,et al. Write a Classifier: Predicting Visual Classifiers from Unstructured Text , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38] Trevor Darrell,et al. Adversarial Feature Learning , 2016, ICLR.

[39] David Vázquez,et al. PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[40] David M. Blei,et al. Variational Inference: A Review for Statisticians , 2016, ArXiv.

[41] Stefano Ermon,et al. Flow-GAN: Combining Maximum Likelihood and Adversarial Learning in Generative Models , 2017, AAAI.

[42] Philip H. S. Torr,et al. An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[43] Nitish Srivastava,et al. Learning Representations for Multimodal Data with Deep Belief Nets , 2012 .

[44] Shaogang Gong,et al. Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content , 2017, IEEE Signal Processing Magazine.

[45] Wei-Lun Chao,et al. Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Csaba Szepesvári,et al. Deep Representations and Codes for Image Auto-Annotation , 2012, NIPS.

[47] Pietro Perona,et al. The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[48] Michael I. Jordan,et al. An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[49] Hongwei Liu,et al. Deep Latent Dirichlet Allocation with Topic-Layer-Adaptive Stochastic Gradient Riemannian MCMC , 2017, ICML.

[50] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51] Aaron C. Courville,et al. Adversarially Learned Inference , 2016, ICLR.

[52] Rob Fergus,et al. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[53] C. V. Jawahar,et al. Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[55] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Hao Zhang,et al. WHAI: Weibull Hybrid Autoencoding Inference for Deep Topic Modeling , 2018, ICLR.