论文信息 - TIME: Text and Image Mutual-Translation Adversarial Networks

TIME: Text and Image Mutual-Translation Adversarial Networks

Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator $G$ and an image captioning discriminator $D$ under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image-text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of $G$ can be boosted substantially by training it jointly with $D$ as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design a hinged and annealing conditional loss that dynamically balances the adversarial learning. In our experiments, TIME establishes the new state-of-the-art Inception Score of 4.88 on the CUB dataset, and shows competitive performance on MS-COCO on both text-to-image and image captioning tasks.

[1] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[3] Pierre Isabelle,et al. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , 2002, ACL 2002.

[4] Nenghai Yu,et al. Semantics Disentangling for Text-To-Image Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6] Xiaogang Wang,et al. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[8] Jing Zhang,et al. MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Fu Li,et al. Dualattn-GAN: Text to Image Synthesis With Dual Attentional Generative Adversarial Network , 2019, IEEE Access.

[10] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[12] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.

[13] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[15] Thomas Fevens,et al. Dual Adversarial Inference for Text-to-Image Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Vladimir Pavlovic,et al. The Art of Food: Meal Image Synthesis from Ingredients , 2019, ArXiv.

[17] Dustin Tran,et al. Hierarchical Implicit Models and Likelihood-Free Variational Inference , 2017, NIPS.

[18] Ruslan Salakhutdinov,et al. Generating Images from Captions with Attention , 2015, ICLR.

[19] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Pietro Perona,et al. Caltech-UCSD Birds 200 , 2010 .

[23] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24] Stefan Wermter,et al. Semantic Object Accuracy for Generative Text-to-Image Synthesis , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25] Alex Graves,et al. Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[26] Simon Osindero,et al. Conditional Generative Adversarial Nets , 2014, ArXiv.

[27] Sebastian Nowozin,et al. Which Training Methods for GANs do actually Converge? , 2018, ICML.

[28] Thomas Lukasiewicz,et al. Controllable Text-to-Image Generation , 2019, NeurIPS.

[29] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[30] Lei Zhang,et al. Turbo Learning for Captionbot and Drawingbot , 2018, NeurIPS.

[31] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[32] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[33] Ahmed El Kholy,et al. UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[34] Tao Mei,et al. X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[36] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Timo Aila,et al. A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Yoshua Bengio,et al. Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[40] Dustin Tran,et al. Deep and Hierarchical Implicit Models , 2017, ArXiv.

[41] Yuichi Yoshida,et al. Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[42] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[43] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[44] Wei Chen,et al. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[46] Jaakko Lehtinen,et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[47] Sergio Gomez Colmenarejo,et al. Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.

[48] Lei Zhang,et al. Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[50] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[51] Jae Hyun Lim,et al. Geometric GAN , 2017, ArXiv.