论文信息 - Zero-Shot Text-to-Image Generation

Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

[1] Ruslan Salakhutdinov,et al. Generating Images from Captions with Attention , 2015, ICLR.

[2] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[3] Yoshua Bengio,et al. Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[6] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[7] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[8] Ben Poole,et al. Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[9] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[10] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[11] Swagath Venkataramani,et al. Ultra-Low Precision 4-bit Training of Deep Neural Networks , 2020, NeurIPS.

[12] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[14] Xiaogang Wang,et al. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Yee Whye Teh,et al. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[16] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[17] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[18] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[19] Xi Chen,et al. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[20] Ilya Sutskever,et al. Jukebox: A Generative Model for Music , 2020, ArXiv.

[21] Ali Razavi,et al. Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[22] Bernt Schiele,et al. Generative Adversarial Text to Image Synthesis , 2016, ICML.

[23] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24] Samyam Rajbhandari,et al. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, ArXiv.

[25] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[26] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[27] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[28] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29] Honglak Lee,et al. Text-to-Image Generation Grounded by Fine-Grained User Attention , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Zhe Gan,et al. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[33] Geoffrey E. Hinton. Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1991 .

[34] Alex Graves,et al. DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[35] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[37] Wei Chen,et al. DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] N. Sebe,et al. DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[39] Martin Jaggi,et al. PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[40] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[41] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[42] Bernt Schiele,et al. Learning What and Where to Draw , 2016, NIPS.

[43] Xin Wang,et al. Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[44] Ivan Provilkov,et al. BPE-Dropout: Simple and Effective Subword Regularization , 2019, ACL.

[45] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[46] Klaus Greff,et al. On the Binding Problem in Artificial Neural Networks , 2020, ArXiv.

[47] Samy Bengio,et al. Generating Sentences from a Continuous Space , 2015, CoNLL.

[48] Jiasen Lu,et al. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers , 2020, EMNLP.

[49] Dan Klein,et al. Learning with Latent Language , 2017, NAACL.

[50] Dimitris N. Metaxas,et al. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[51] Simon Haykin,et al. GradientBased Learning Applied to Document Recognition , 2001 .

[52] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[53] Lei Zhang,et al. Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).