DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis

Text-to-image synthesis refers to generating an image from a given text description, the key goal of which lies in photo realism and semantic consistency. Previous methods usually generate an initial image with sentence embedding and then refine it with fine-grained word embedding. Despite the significant progress, the ‘aspect’ information (e.g., red eyes) contained in the text, referring to several words rather than a word that depicts ‘a particular part or feature of something’, is often ignored, which is highly helpful for synthesizing image details. How to make better utilization of aspect information in text-to-image synthesis still remains an unresolved challenge. To address this problem, in this paper, we propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentencelevel, word-level, and aspect-level. Moreover, inspired by human learning behaviors, we develop a novel Aspectaware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed. AGR utilizes word-level embedding to globally enhance the previously generated image, while ALR dynamically employs aspect-level embedding to refine image details from a local perspective. Finally, a corresponding matching loss function is designed to ensure the text-image semantic consistency at different levels. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and COCO) demonstrate the superiority and rationality of our method.

[1]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Ian Oppermann,et al.  Realistic Image Generation using Region-phrase Attention , 2019, ACML.

[3]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[5]  Xing Xie,et al.  DRr-Net: Dynamic Re-Read Network for Sentence Semantic Matching , 2019, AAAI.

[6]  Yoshua Bengio,et al.  Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  I. Rentschler,et al.  Peripheral vision and pattern recognition: a review. , 2011, Journal of vision.

[8]  Jiachen Li,et al.  Text Guided Person Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Sergio Gomez Colmenarejo,et al.  Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.

[10]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Li Zhao,et al.  Attention-based LSTM for Aspect-level Sentiment Classification , 2016, EMNLP.

[13]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[14]  Jiale Zhi PixelBrush : Art Generation from text with GANs , 2017 .

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Nicu Sebe,et al.  DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[17]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Wei Sun,et al.  Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Tieyun Qian,et al.  Relation-Aware Collaborative Learning for Unified Aspect-Based Sentiment Analysis , 2020, ACL.

[21]  Dacheng Tao,et al.  Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis , 2021, IEEE Transactions on Image Processing.

[22]  Silvio Savarese,et al.  Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings , 2018, ACCV.

[23]  Antonio Torralba,et al.  How to Make a Pizza: Learning a Compositional Layer-Based GAN Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Baoyuan Wu,et al.  TediGAN: Text-Guided Diverse Image Generation and Manipulation , 2020, ArXiv.

[25]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[26]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[27]  Jun Cheng,et al.  RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[29]  Xin Li,et al.  KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis , 2020, IEEE Transactions on Image Processing.

[30]  Yijun Wang,et al.  Context-Aware Generation-Based Net For Multi-Label Visual Emotion Recognition , 2020, 2020 IEEE International Conference on Multimedia and Expo (ICME).

[31]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Richong Zhang,et al.  Replicate, Walk, and Stop on Syntax: An Effective Neural Network Model for Aspect-Level Sentiment Classification , 2020, AAAI.

[33]  Bin Zhu,et al.  CookGAN: Causality Based Text-to-Image Synthesis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[35]  J. Dichgans,et al.  Differential effects of central versus peripheral vision on egocentric and exocentric motion perception , 1973, Experimental Brain Research.

[36]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[37]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[39]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Luo Si,et al.  Knowing What, How and Why: A Near Complete Solution for Aspect-based Sentiment Analysis , 2019, AAAI.

[41]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[42]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[43]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[44]  W. Warren,et al.  The role of central and peripheral vision in perceiving the direction of self-motion , 1992, Perception & psychophysics.