Zero-Shot Text-to-Image Generation

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

[1]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[2]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[3]  Yoshua Bengio,et al.  Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[7]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[8]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[9]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[10]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[11]  Swagath Venkataramani,et al.  Ultra-Low Precision 4-bit Training of Deep Neural Networks , 2020, NeurIPS.

[12]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[14]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[16]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[17]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[18]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[19]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[20]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[21]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[22]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[23]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24]  Samyam Rajbhandari,et al.  ZeRO: Memory Optimization Towards Training A Trillion Parameter Models , 2019, ArXiv.

[25]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[26]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[27]  Jiawei Han,et al.  Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[28]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Honglak Lee,et al.  Text-to-Image Generation Grounded by Fine-Grained User Attention , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[30]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[33]  Geoffrey E. Hinton Tensor Product Variable Binding and the Representation of Symbolic Structures in Connectionist Systems , 1991 .

[34]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[37]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  N. Sebe,et al.  DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[39]  Martin Jaggi,et al.  PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization , 2019, NeurIPS.

[40]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[41]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[42]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[43]  Xin Wang,et al.  Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks , 2017, NIPS.

[44]  Ivan Provilkov,et al.  BPE-Dropout: Simple and Effective Subword Regularization , 2019, ACL.

[45]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[46]  Klaus Greff,et al.  On the Binding Problem in Artificial Neural Networks , 2020, ArXiv.

[47]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[48]  Jiasen Lu,et al.  X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers , 2020, EMNLP.

[49]  Dan Klein,et al.  Learning with Latent Language , 2017, NAACL.

[50]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[52]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[53]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).