论文信息 - Zero-Shot Text-to-Image Generation

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

[1] Geoffrey E. Hinton. Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1991 .

[2] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .

[3] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[5] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[6] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[7] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8] Alex Graves,et al. DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Ruslan Salakhutdinov,et al. Generating Images from Captions with Attention , 2015, ICLR.

[11] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[12] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[13] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[14] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[15] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[16] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[17] Bernt Schiele,et al. Learning What and Where to Draw , 2016, NIPS.

[18] Samy Bengio,et al. Generating Sentences from a Continuous Space , 2015, CoNLL.

[19] Yoshua Bengio,et al. Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[21] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[22] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[23] Yee Whye Teh,et al. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[24] Xi Chen,et al. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[25] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[28] Xin Wang,et al. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[29] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[31] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[32] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[34] Dan Klein,et al. Learning with Latent Language , 2017, NAACL.

[35] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[36] Xiaogang Wang,et al. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] Ali Razavi,et al. Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[38] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[39] Wei Chen,et al. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[41] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[42] Lei Zhang,et al. Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Swagath Venkataramani,et al. Ultra-Low Precision 4-bit Training of Deep Neural Networks , 2020, NeurIPS.

[44] Ilya Sutskever,et al. Jukebox: A Generative Model for Music , 2020, ArXiv.

[45] Samyam Rajbhandari,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, ArXiv.

[46] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[47] N. Sebe,et al. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[48] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[49] Ivan Provilkov,et al. BPE-Dropout: Simple and Effective Subword Regularization , 2019, ACL.

[50] Klaus Greff,et al. On the Binding Problem in Artificial Neural Networks , 2020, ArXiv.

[51] Jiasen Lu,et al. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers , 2020, EMNLP.

[52] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[53] Honglak Lee,et al. Text-to-Image Generation Grounded by Fine-Grained User Attention , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).