CogView: Mastering Text-to-Image Generation via Transformers

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E. 1 A tiger is playing football. A coffee cup printed with a cat. Sky background. A beautiful young blond woman talking on a phone. A Big Ben clock towering over the city of London. A man is flying to the moon on his bicycle A couple wearing leather bik‐ er garb rides a motorcycle. Super-resolution mid-lake pavilion Chinese traditional draw‐ ing. Statue of Liberty. Oil painting. Lion. Cartoon. A tiger is playing football. Sketch. Houses. Figure 1: Samples generated by CogView. The text in the first line is either from MS COCO (outside our training set) or user queries on our demo website. The images in the second line are finetuned results for different styles or super-resolution. The actual input text is in Chinese, which is translated into English here for better understanding. More samples for captions from MS COCO are included in Appendix F.

[1]  Sha Yuan,et al.  WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models , 2021, AI Open.

[2]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[4]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[5]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[6]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[7]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[8]  Honglak Lee,et al.  Text-to-Image Generation Grounded by Fine-Grained User Attention , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Hongxia Yang,et al.  Controllable Generation from Pre-trained Language Models via Inverse Prompting , 2021, KDD.

[11]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[12]  Lei Zhang,et al.  Object-Driven Text-To-Image Synthesis via Adversarial Training , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zhengxiao Du,et al.  All NLP Tasks Are Generation Tasks: A General Pretraining Framework , 2021, arXiv.org.

[14]  Liwei Wang,et al.  On Layer Normalization in the Transformer Architecture , 2020, ICML.

[15]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[16]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[17]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[18]  Patrick Esser,et al.  Taming Transformers for High-Resolution Image Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[21]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[22]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[23]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[24]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[25]  Franziska R. Richter,et al.  Multimodal Feature Integration in the Angular Gyrus during Episodic and Semantic Retrieval , 2016, The Journal of Neuroscience.

[26]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[27]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[28]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[29]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[31]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[32]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[33]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Graham Neubig,et al.  Lagging Inference Networks and Posterior Collapse in Variational Autoencoders , 2019, ICLR.

[35]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[36]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[39]  Nicu Sebe,et al.  DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[40]  K. Grill-Spector,et al.  The human visual cortex. , 2004, Annual review of neuroscience.

[41]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.