Exploring Global and Local Linguistic Representations for Text-to-Image Synthesis

The task of text-to-image synthesis is to generate photographic images conditioned on given textual descriptions. This challenging task has recently attracted considerable attention from the multimedia community due to its potential applications. Most of the up-to-date approaches are built based on generative adversarial network (GAN) models, and they synthesize images conditioned on the global linguistic representation. However, the sparsity of the global representation results in training difficulties on GANs and a shortage of fine-grained information in the generated images. To address this problem, we propose cross-modal global and local linguistic representations-based generative adversarial networks (CGL-GAN) by incorporating the local linguistic representation into the GAN. In our CGL-GAN, we construct a generator to synthesize the target images and a discriminator to judge whether the generated images conform with the text description. In the discriminator, we construct the cross-modal correlation by projecting the image representations at high and low levels onto the global and local linguistic representations, respectively. We design the hinge loss function to train our CGL-GAN model. We evaluate the proposed CGL-GAN on two publicly available datasets, the CUB and the MS-COCO. The extensive experiments demonstrate that incorporating fine-grained local linguistic information with cross-modal correlation can greatly improve the performance of text-to-image synthesis, even when generating high-resolution images.

[1]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[2]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[7]  Jingkuan Song,et al.  Binary Generative Adversarial Networks for Image Retrieval , 2017, AAAI.

[8]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Yuxin Peng,et al.  TPCKT: Two-Level Progressive Cross-Media Knowledge Transfer , 2019, IEEE Transactions on Multimedia.

[10]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[11]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Jae Hyun Lim,et al.  Geometric GAN , 2017, ArXiv.

[13]  Ruifan Li,et al.  Deep correspondence restricted Boltzmann machine for cross-modal retrieval , 2015, Neurocomputing.

[14]  Guanghui Wang,et al.  Adversarially Approximated Autoencoder for Image Generation and Manipulation , 2019, IEEE Transactions on Multimedia.

[15]  Tao Mei,et al.  Deep Semantic Hashing with Generative Adversarial Networks , 2017, SIGIR.

[16]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[17]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zi Huang,et al.  Supervised Robust Discrete Multimodal Hashing for Cross-Media Retrieval , 2016, IEEE Transactions on Multimedia.

[19]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[20]  Lingfeng Wang,et al.  Deep Hierarchical Encoder–Decoder Network for Image Captioning , 2019, IEEE Transactions on Multimedia.

[21]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[22]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[25]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[26]  Yann LeCun,et al.  Energy-based Generative Adversarial Networks , 2016, ICLR.

[27]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[28]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[29]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[30]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[31]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[32]  Yongdong Zhang,et al.  GLA: Global–Local Attention for Image Description , 2018, IEEE Transactions on Multimedia.

[33]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[37]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[38]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[40]  Qingyao Wu,et al.  Auto-Embedding Generative Adversarial Networks For High Resolution Image Synthesis , 2019, IEEE Transactions on Multimedia.

[41]  David Berthelot,et al.  BEGAN: Boundary Equilibrium Generative Adversarial Networks , 2017, ArXiv.

[42]  Chang Zhou,et al.  Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning , 2018, IJCAI.