AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

In this paper, we propose an Attentional Generative Adversarial Network (AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained text-to-image generation. With a novel attentional generative network, the AttnGAN can synthesize fine-grained details at different sub-regions of the image by paying attentions to the relevant words in the natural language description. In addition, a deep attentional multimodal similarity model is proposed to compute a fine-grained image-text matching loss for training the generator. The proposed AttnGAN significantly outperforms the previous state of the art, boosting the best reported inception score by 14.14% on the CUB dataset and 170.25% on the more challenging COCO dataset. A detailed analysis is also performed by visualizing the attention layers of the AttnGAN. It for the first time shows that the layered attentional GAN is able to automatically select the condition at the word level for generating different parts of the image.

[1]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[2]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[3]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[4]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[5]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[11]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[12]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[13]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[14]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[17]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[18]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[21]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[22]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[23]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[25]  Yoshua Bengio,et al.  Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Sergio Gomez Colmenarejo,et al.  Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.

[28]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lin Yang,et al.  MDNet: A Semantically and Visually Interpretable Medical Image Diagnosis Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Xi Peng,et al.  A Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Han Zhang,et al.  Improving GANs Using Optimal Transport , 2018, ICLR.

[35]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Lin Yang,et al.  Translating and Segmenting Multimodal Medical Volumes with Cycle- and Shape-Consistency Generative Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Hang Zhang,et al.  Multi-style Generative Network for Real-time Transfer , 2017, ECCV Workshops.

[38]  Xiaogang Wang,et al.  Context Encoding for Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Vishal M. Patel,et al.  Densely Connected Pyramid Dehazing Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Vishal M. Patel,et al.  Image De-Raining Using a Conditional Generative Adversarial Network , 2017, IEEE Transactions on Circuits and Systems for Video Technology.