DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

Synthesizing high-quality realistic images from text descriptions is a challenging task. Existing text-to-image Generative Adversarial Networks generally employ a stacked architecture as the backbone yet still remain three flaws. First, the stacked architecture introduces the entanglements between generators of different image scales. Second, existing studies prefer to apply and fix extra networks in adversarial learning for text-image semantic consistency, which limits the supervision capability of these networks. Third, the cross-modal attention-based text-image fusion that widely adopted by previous works is limited on several special image scales because of the computational cost. To these ends, we propose a simpler but more effective Deep Fusion Generative Adversarial Networks (DF-GAN). To be specific, we propose: (i) a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators, (ii) a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output, which enhances the text-image semantic consistency without introducing extra networks, (iii) a novel deep text-image fusion block, which deepens the fusion process to make a full fusion between text and visual features. Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images and achieves better performance on widely used datasets. Code is available at https://github.com/tobran/DF-GAN.

[1]  Enhong Chen,et al.  DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[3]  Xianyan Jia,et al.  M6: A Chinese Multimodal Pretrainer , 2021, ArXiv.

[4]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[5]  Jing Yu Koh,et al.  Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ning Wang,et al.  Exploring Global and Local Linguistic Representations for Text-to-Image Synthesis , 2020, IEEE Transactions on Multimedia.

[7]  Lambert Schomaker,et al.  DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation , 2020, 2021 International Joint Conference on Neural Networks (IJCNN).

[8]  Li Wang,et al.  The Defense of Adversarial Example with Conditional Generative Adversarial Networks , 2020, Secur. Commun. Networks.

[9]  Wenjie Pei,et al.  CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ECCV.

[10]  Arun Mallya,et al.  Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications , 2020, Proceedings of the IEEE.

[11]  Mingkuan Yuan,et al.  CKD: Cross-Task Knowledge Distillation for Text-to-Image Synthesis , 2020, IEEE Transactions on Multimedia.

[12]  Nicu Sebe,et al.  XingGAN for Person Image Generation , 2020, ECCV.

[13]  Jun Cheng,et al.  RiFeGAN: Rich Feature Generation for Text-to-Image Synthesis From Prior Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[15]  Gerard de Melo,et al.  TIME: Text and Image Mutual-Translation Adversarial Networks , 2020, AAAI.

[16]  Mei Han,et al.  SegAttnGAN: Text to Image Generation with Segmentation Attention , 2020, ArXiv.

[17]  Wen-Huang Cheng,et al.  Fashion Meets Computer Vision , 2020, ACM Comput. Surv..

[18]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Philip H. S. Torr,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[20]  Nenghai Yu,et al.  Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  S. Hoi,et al.  Deep Learning for Image Super-Resolution: A Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[26]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[27]  Joseph Redmon,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[28]  Yoshua Bengio,et al.  Dynamic Neural Turing Machine with Continuous and Discrete Addressing Schemes , 2018, Neural Computation.

[29]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[30]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Sebastian Nowozin,et al.  Which Training Methods for GANs do actually Converge? , 2018, ICML.

[32]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[35]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[36]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[37]  Jae Hyun Lim,et al.  Geometric GAN , 2017, ArXiv.

[38]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[41]  Gang Wang,et al.  Gated Siamese Convolutional Neural Network Architecture for Human Re-identification , 2016, ECCV.

[42]  Gang Wang,et al.  A Siamese Long Short-Term Memory Architecture for Human Re-identification , 2016, ECCV.

[43]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[44]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[45]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[46]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Christian Szegedy,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[49]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[50]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51]  Aaron C. Courville,et al.  Generative adversarial networks , 2014, Commun. ACM.

[52]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[53]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[54]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[55]  Trevor Darrell,et al.  Benchmark for Compositional Text-to-Image Synthesis , 2021, NeurIPS Datasets and Benchmarks.

[56]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[57]  Dacheng Tao,et al.  Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge , 2019, NeurIPS.

[58]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[59]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .