Semantic layout aware generative adversarial network for text-to-image generation

Text-to-image(T2I) generation methods aim to synthesize a high-quality image which is semantically consistent with the given text descriptions. Previous (T2I) generative adversarial networks generally first create a low-resolution image with rough shapes and colors, and then refine the initial image into a high-resolution image. Most stacked architecture still remains two main problems. (1) The final images generated by these methods depend heavily on the quality of the initial image. If the initial one is not initialized correctly, the resulted image seems like a simple combination of visual features from several images scales. (2) The cross-modal fusion methods about text and image that previous works widely adopted is limited in the text-image fusion process. In the paper, we propose a novel generation model, which introduce a one-stage backbone directly generate high-quality images without multi generators and a novel semantic layout deep fusion network to sufficiently fuse text features and image features. Experiments on the challenging CUB and COCO-Stuff datasets demonstrates the ability of our model in generating images, regarding both semantic consistency with input text description and visual fidelity.

[1]  Wenjun Zhang,et al.  Text Pared into Scene Graph for Diverse Image Generation , 2021, CSAE.

[2]  Jie Ma,et al.  SD-GAN: Saliency-Discriminated GAN for Remote Sensing Image Superresolution , 2020, IEEE Geoscience and Remote Sensing Letters.

[3]  Xiaoyuan Jing,et al.  DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  R Devon Hjelm,et al.  Object-Centric Image Generation from Layouts , 2020, AAAI.

[5]  Bo Zhao,et al.  Layout2image: Image Generation from Layout , 2020, International Journal of Computer Vision.

[6]  Oron Ashual,et al.  Specifying Object Attributes and Relations in Interactive Scene Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Wei Sun,et al.  Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Xiaogang Wang,et al.  PasteGAN: A Semi-Parametric Method to Generate Image from Scene Graph , 2019, NeurIPS.

[9]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[14]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[16]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[17]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[18]  Ben Calderhead,et al.  Advances in Neural Information Processing Systems 29 , 2016 .