StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks

Synthesizing high-quality images from text descriptions is a challenging problem in computer vision and has many practical applications. Samples generated by existing textto- image approaches can roughly reflect the meaning of the given descriptions, but they fail to contain necessary details and vivid object parts. In this paper, we propose Stacked Generative Adversarial Networks (StackGAN) to generate 256.256 photo-realistic images conditioned on text descriptions. We decompose the hard problem into more manageable sub-problems through a sketch-refinement process. The Stage-I GAN sketches the primitive shape and colors of the object based on the given text description, yielding Stage-I low-resolution images. The Stage-II GAN takes Stage-I results and text descriptions as inputs, and generates high-resolution images with photo-realistic details. It is able to rectify defects in Stage-I results and add compelling details with the refinement process. To improve the diversity of the synthesized images and stabilize the training of the conditional-GAN, we introduce a novel Conditioning Augmentation technique that encourages smoothness in the latent conditioning manifold. Extensive experiments and comparisons with state-of-the-arts on benchmark datasets demonstrate that the proposed method achieves significant improvements on generating photo-realistic images conditioned on text descriptions.

[1]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[2]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[3]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[4]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[5]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Scott E. Reed,et al.  Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[10]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Jon Gauthier Conditional generative adversarial nets for convolutional face generation , 2015 .

[13]  Yuting Zhang,et al.  Deep Visual Analogy-Making , 2015, NIPS.

[14]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[15]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[16]  Thomas Brox,et al.  Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[18]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[19]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[21]  Yann LeCun,et al.  Energy-based Generative Adversarial Network , 2016, ICLR.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[24]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[25]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[26]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[27]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[28]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[29]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Abhinav Gupta,et al.  Generative Image Modeling Using Style and Structure Adversarial Networks , 2016, ECCV.

[31]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[32]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[33]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[34]  Alexei A. Efros,et al.  Generative Visual Manipulation on the Natural Image Manifold , 2016, ECCV.

[35]  Lior Wolf,et al.  Unsupervised Cross-Domain Image Generation , 2016, ICLR.

[36]  Yoshua Bengio,et al.  Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andrew Brock,et al.  Neural Photo Editing with Introspective Adversarial Networks , 2016, ICLR.

[38]  Lucas Theis,et al.  Amortised MAP Inference for Image Super-resolution , 2016, ICLR.

[39]  David Pfau,et al.  Unrolled Generative Adversarial Networks , 2016, ICLR.

[40]  Yann LeCun,et al.  Energy-based Generative Adversarial Networks , 2016, ICLR.

[41]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[43]  John E. Hopcroft,et al.  Stacked Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Nando de Freitas,et al.  Generating Interpretable Images with Controllable Structure , 2017 .

[45]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[47]  Yoshua Bengio,et al.  Mode Regularized Generative Adversarial Networks , 2016, ICLR.