论文信息 - Instance Mask Embedding and Attribute-Adaptive Generative Adversarial Network for Text-to-Image Synthesis

Instance Mask Embedding and Attribute-Adaptive Generative Adversarial Network for Text-to-Image Synthesis

Existing image generation models have achieved the synthesis of reasonable individuals and complex but low-resolution images. Directly from complicated text to high-resolution image generation still remains a challenge. To this end, we propose the instance mask embedding and attribute-adaptive generative adversarial network (IMEAA-GAN). Firstly, we use the box regression network to compute a global layout containing the class labels and locations for each instance. Then the global generator encodes the layout, combines the whole text embedding and noise to preliminarily generate a low-resolution image; the instance embedding mechanism is used firstly to guide local refinement generators obtain fine-grained local features and generate a more realistic image. Finally, in order to synthesize the exact visual attributes, we introduce the multi-scale attribute-adaptive discriminator, which provides local refinement generators with the specific training signals to explicitly generate instance-level features. Extensive experiments based on the MS-COCO dataset and the Caltech-UCSD Birds-200-2011 dataset show that our model can obtain globally consistent attributes and generate complex images with local texture details.

[1] Xiangyu Zhang,et al. Bounding Box Regression With Uncertainty for Accurate Object Detection , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Yoshua Bengio,et al. Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[4] Marcus Liwicki,et al. TAC-GAN - Text Conditioned Auxiliary Classifier Generative Adversarial Network , 2017, ArXiv.

[5] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Jürgen Schmidhuber,et al. LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[7] H. T. Kung,et al. Adversarial Learning of Semantic Relevance in Text to Image Synthesis , 2018, AAAI.

[8] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[9] Pietro Perona,et al. The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[10] Li Fei-Fei,et al. Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11] Xiaogang Wang,et al. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Lei Zhang,et al. Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Tao Mei,et al. DA-GAN: Instance-Level Image Translation by Deep Attention Generative Adversarial Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Yiming Yang,et al. MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[17] Rob Fergus,et al. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[18] Joni-Kristian Kämäräinen,et al. Feature representation and discrimination based on Gaussian mixture model probability densities - Practices and algorithms , 2006, Pattern Recognit..

[19] Wei Sun,et al. Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Yoshua Bengio,et al. ChatPainter: Improving Text to Image Generation using Dialogue , 2018, ICLR.

[21] Jing Zhang,et al. MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Vladlen Koltun,et al. Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Yu Qiao,et al. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[24] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25] Yu Tian,et al. Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Tingfa Xu,et al. LayoutGAN: Generating Graphic Layouts with Wireframe Discriminators , 2019, ICLR.

[27] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[28] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[29] Jian Yang,et al. Image Super-Resolution via Deep Recursive Residual Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Jun Zhu,et al. Triple Generative Adversarial Nets , 2017, NIPS.

[31] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[32] Jian Sun,et al. Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.

[34] Tobias Hinz,et al. Semantic Object Accuracy for Generative Text-to-Image Synthesis , 2020, IEEE transactions on pattern analysis and machine intelligence.

[35] Alexei A. Efros,et al. Generative Visual Manipulation on the Natural Image Manifold , 2016, ECCV.

[36] Bo Zhao,et al. Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[38] Jan Kautz,et al. Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[39] Dimitrios Androutsos,et al. Large Receptive Field Networks for High-Scale Image Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[40] Carl E. Rasmussen,et al. The Infinite Gaussian Mixture Model , 1999, NIPS.

[41] Yuxin Peng,et al. Text-to-image Synthesis via Symmetrical Distillation Networks , 2018, ACM Multimedia.

[42] Seunghoon Hong,et al. Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Yike Guo,et al. Semantic Image Synthesis via Adversarial Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Stefan Wermter,et al. Generating Multiple Objects at Spatially Distinct Locations , 2019, ICLR.

[45] Chi-Keung Tang,et al. Image Generation from Sketch Constraint Using Contextual GAN , 2017, ECCV.

[46] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[47] Lin Yang,et al. Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48] Nicu Sebe,et al. Attribute-Guided Sketch Generation , 2019, 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).

[49] Jiebo Luo,et al. Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Gang Hu,et al. Sharp and Real Image Super-Resolution Using Generative Adversarial Network , 2017, ICONIP.

[51] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.