CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis

Typical methods for text-to-image synthesis seek to design effective generative architecture to model the text-to-image mapping directly. It is fairly arduous due to the cross-modality translation. In this paper we circumvent this problem by focusing on parsing the content of both the input text and the synthesized image thoroughly to model the text-to-image consistency in the semantic level. Particularly, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images during text encoding. Meanwhile, the synthesized image is parsed to learn its semantics in an object-aware manner. Moreover, we customize a conditional discriminator to model the fine-grained correlations between words and image sub-regions to push for the text-image semantic alignment. Extensive experiments on COCO dataset manifest that our model advances the state-of-the-art performance significantly (from 35.69 to 52.73 in Inception Score).

[1]  Nenghai Yu,et al.  Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[3]  Yike Guo,et al.  Semantic Image Synthesis via Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Yunfei Liu,et al.  What I See Is What You See: Joint Attention Learning for First and Third Person Video Co-analysis , 2019, ACM Multimedia.

[5]  Xin Li,et al.  Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Yu-Wing Tai,et al.  Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[8]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[9]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[10]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[11]  Feifan Lv,et al.  Attention-guided Low-light Image Enhancement , 2019, ArXiv.

[12]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[13]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  H. T. Kung,et al.  Adversarial Learning of Semantic Relevance in Text to Image Synthesis , 2018, AAAI.

[15]  Yu Li,et al.  Unsupervised Learning for Intrinsic Image Decomposition From a Single Image , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Dacheng Tao,et al.  Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge , 2019, NeurIPS.

[17]  Gholamreza Haffari,et al.  Document Context Neural Machine Translation with Memory Networks , 2017, ACL.

[18]  Thomas Fevens,et al.  Dual Adversarial Inference for Text-to-Image Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[20]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[21]  Thomas Lukasiewicz,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[22]  Yuxin Peng,et al.  Bridge-GAN: Interpretable Representation Learning for Text-to-Image Synthesis , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Mingkuan Yuan,et al.  CKD: Cross-Task Knowledge Distillation for Text-to-Image Synthesis , 2020, IEEE Transactions on Multimedia.

[24]  H. T. Kung,et al.  Adversarial nets with perceptual losses for text-to-image synthesis , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[25]  Xukun Shen,et al.  Makeup Removal via Bidirectional Tunable De-Makeup Network , 2019, IEEE Transactions on Multimedia.

[26]  Preslav Nakov,et al.  Automatic Stance Detection Using End-to-End Memory Networks , 2018, NAACL.

[27]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[28]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[29]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Feng Lu,et al.  Separate in Latent Space: Unsupervised Single Image Layer Separation , 2019, AAAI.

[32]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Sergio Gomez Colmenarejo,et al.  Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.

[34]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[36]  Chunhua Shen,et al.  Visual Question Answering with Memory-Augmented Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[39]  Shuai Wang,et al.  Target-Sensitive Memory Networks for Aspect Sentiment Classification , 2018, ACL.

[40]  Imari Sato,et al.  Pathological Evidence Exploration in Deep Retinal Image Diagnosis , 2018, AAAI.

[41]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[42]  Nando de Freitas,et al.  Generating Interpretable Images with Controllable Structure , 2017 .

[43]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[44]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Rajarshi Das,et al.  Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks , 2017, ACL.

[46]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yang Feng,et al.  Memory-augmented Neural Machine Translation , 2017, EMNLP.

[48]  Tobias Hinz,et al.  Semantic Object Accuracy for Generative Text-to-Image Synthesis , 2020, IEEE transactions on pattern analysis and machine intelligence.

[49]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[50]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Tianqi Chen,et al.  Empirical Evaluation of Rectified Activations in Convolutional Network , 2015, ArXiv.

[52]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.