论文信息 - Disentangling for Text-to-Image Generation

Disentangling for Text-to-Image Generation

Synthesizing photo-realistic images from text descriptions is a challenging problem. Previous studies have shown remarkable progresses on visual quality of the generated images. In this paper, we consider semantics from the input text descriptions in helping render photo-realistic images. However, diverse linguistic expressions pose challenges in extracting consistent semantics even they depict the same thing. To this end, we propose a novel photo-realistic textto-image generation model that implicitly disentangles semantics to both fulfill the high-level semantic consistency and low-level semantic diversity. To be specific, we design (1) a Siamese mechanism in the discriminator to learn consistent high-level semantics, and (2) a visual-semantic embedding strategy by semantic-conditioned batch normalization to find diverse low-level semantics. Extensive experiments and ablation studies on CUB and MS-COCO datasets demonstrate the superiority of the proposed method in comparison to state-of-the-art methods.

[1] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[2] Aggelos K. Katsaggelos,et al. Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Yunchao Wei,et al. Perceptual Generative Adversarial Networks for Small Object Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Sebastian Nowozin,et al. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.

[5] Bernt Schiele,et al. Learning What and Where to Draw , 2016, NIPS.

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[8] Luc Van Gool,et al. Wasserstein Divergence for GANs , 2017, ECCV.

[9] Tao Mei,et al. Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Li Fei-Fei,et al. Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[13] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14] Yoshua Bengio,et al. Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Seunghoon Hong,et al. Inferring Semantic Layout for Text-to-Image Synthesis , 2018 .

[16] Jonathon Shlens,et al. A Learned Representation For Artistic Style , 2016, ICLR.

[17] Gang Wang,et al. Gated Siamese Convolutional Neural Network Architecture for Human Re-identification , 2016, ECCV.

[18] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19] Dimitris Samaras,et al. Shadow Detection with Conditional Generative Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[21] Yann LeCun,et al. Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[22] Serge J. Belongie,et al. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[24] Xiaogang Wang,et al. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] Aaron C. Courville,et al. Learning Visual Reasoning Without Strong Priors , 2017, ICML 2017.

[26] Xu Chen,et al. Fictitious GAN: Training GANs with Historical Models , 2018, ECCV.

[27] Xiaogang Wang,et al. FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification , 2018, NeurIPS.

[28] Edward J. Delp,et al. A Two Stream Siamese Convolutional Neural Network for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29] Kunio Kashino,et al. Generative Attribute Controller with Conditional Filtered Generative Adversarial Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Dumitru Erhan,et al. Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Vineeth N. Balasubramanian,et al. C4Synth: Cross-Caption Cycle-Consistent Text-to-Image Synthesis , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33] Bogdan Raducanu,et al. Transferring GANs: generating images from limited data , 2018, ECCV.

[34] John E. Hopcroft,et al. Stacked Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Chi-Keung Tang,et al. Image Generation from Sketch Constraint Using Contextual GAN , 2017, ECCV.

[36] Gang Wang,et al. A Siamese Long Short-Term Memory Architecture for Human Re-identification , 2016, ECCV.

[37] Hugo Larochelle,et al. Modulating early visual processing by language , 2017, NIPS.

[38] Lin Yang,et al. Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Marc'Aurelio Ranzato,et al. DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[40] Pieter Abbeel,et al. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[41] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.