Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks for text-to-image synthesis

Abstract With the development of the generative model, image synthesis has become a research hotspot. This paper presents a novel Cross-modal Feature Alignment based Hybrid Attentional Generative Adversarial Networks (CFA-HAGAN) for text-to-image synthesis. It mainly consists of two steps, text-image encoding and text-to-image synthesis. Text-image encoding learns a Cross-modal Feature Alignment Model (CFAM), which adopts a fine-grained attentional network to learn the original multi-modalities' aligned features. The feature alignment space is viewed as the transitional space in the whole process. Then, the Hybrid Attentional Generative Adversarial Networks (HAGAN) learns the inverse mapping from the encoded text feature to the original image. Specifically, the hybrid attention block consists of text-image cross-modal attention mechanism and self-attention mechanism of an image. Cross-modal attention makes the synthesized image fine-grained by adding word-level information as additional supervision. Self-attention can solve the long-distance reliance problem of image sub-region features when synthesizes images from the hidden feature. Although excellent performance in an ocean of tasks, GANs are well-known for the difficulty of training. Adopting spectral normalization, the discriminators are satisfied with 1-Lipschitz constraint, which makes their training process more stable than original GANs. During quantitative and non-quantitative comparison with many state-of-the-art methods, the experimental results show that the proposed method achieves better performance on evaluation metric and visual effect. Besides, the experimental section presents attention visualization, ablation study, and generalization ability analysis to show the effectiveness of the proposed method.

[1]  Shuiwang Ji,et al.  Multi-Stage Variational Auto-Encoders for Coarse-to-Fine Image Generation , 2017, SDM.

[2]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Cristian Bodnar,et al.  Text to Image Synthesis Using Generative Adversarial Networks , 2018, ArXiv.

[4]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[6]  Yoshua Bengio,et al.  Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[11]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[12]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[13]  Xin Huang,et al.  An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[17]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[18]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[20]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[21]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[22]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[23]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[25]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[26]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[28]  Jian Shen,et al.  Wasserstein Distance Guided Representation Learning for Domain Adaptation , 2017, AAAI.

[29]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lin Yang,et al.  Photographic Text-to-Image Synthesis with a Hierarchically-Nested Adversarial Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[33]  David Pfau,et al.  Unrolled Generative Adversarial Networks , 2016, ICLR.

[34]  Wu Chou,et al.  Discriminative learning in sequential pattern recognition , 2008, IEEE Signal Processing Magazine.

[35]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Nuno Vasconcelos,et al.  Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems , 2014, Comput. Vis. Image Underst..

[38]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[40]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[42]  David Vázquez,et al.  PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[43]  Yoshua Bengio,et al.  Mode Regularized Generative Adversarial Networks , 2016, ICLR.

[44]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[45]  Sheng Tang,et al.  Image Caption with Global-Local Attention , 2017, AAAI.

[46]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[47]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[48]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.