论文信息 - Vector-quantized Image Modeling with Improved VQGAN

Vector-quantized Image Modeling with Improved VQGAN

Pretraining language models with next-token prediction on massive text corpora has delivered phenomenal zero-shot, few-shot, transfer learning and multi-tasking capabilities on both generative and discriminative language tasks. Motivated by this success, we explore a Vector-quantized Image Modeling (VIM) approach that involves pretraining a Transformer to predict rasterized image tokens autoregressively. The discrete image tokens are encoded from a learned Vision-Transformerbased VQGAN (ViT-VQGAN). We first propose multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vectorquantized image modeling tasks, including unconditional, class-conditioned image generation and unsupervised representation learning. When trained on ImageNet at 256 × 256 resolution, we achieve Inception Score (IS) of 175.1 and Fréchet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively. Based on ViT-VQGAN and unsupervised pretraining, we further evaluate the pretrained Transformer by averaging intermediate features, similar to Image GPT (iGPT). This ImageNet-pretrained VIM-L significantly beats iGPT-L on linear-probe accuracy from 60.3% to 72.2% for a similar model size. ViM-L also outperforms iGPT-XL which is trained with extra web image data and larger model size.

[1] Bernt Schiele,et al. A U-Net Based Discriminator for Generative Adversarial Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Cynthia Rudin,et al. PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Jaakko Lehtinen,et al. Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Nikos Komodakis,et al. Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[5] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[6] Prafulla Dhariwal,et al. Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[7] Rewon Child. Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images , 2021, ICLR.

[8] Alex Graves,et al. Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[9] Ali Razavi,et al. Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[10] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[11] Ilya Sutskever,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[12] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[13] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[14] Dustin Tran,et al. Image Transformer , 2018, ICML.

[15] Geoffrey E. Hinton,et al. Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[16] Prafulla Dhariwal,et al. Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[17] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[18] VAEBM: A Symbiosis between Variational Autoencoders and Energy-based Models , 2020, ArXiv.

[19] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Patrick Esser,et al. Taming Transformers for High-Resolution Image Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Trevor Darrell,et al. Adversarial Feature Learning , 2016, ICLR.

[22] Yang Song,et al. Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[23] Zhuowen Tu,et al. Dual Contradistinctive Generative Autoencoder , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Ali Razavi,et al. Data-Efficient Image Recognition with Contrastive Predictive Coding , 2019, ICML.

[25] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.

[26] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Emily L. Denton,et al. On the genealogy of machine learning datasets: A critical history of ImageNet , 2021, Big Data Soc..

[28] Pietro Perona,et al. Towards causal benchmarking of bias in face analysis algorithms , 2020, ECCV.

[29] Jan Kautz,et al. NCP-VAE: Variational Autoencoders with Noise Contrastive Priors , 2020, ArXiv.

[30] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31] Paolo Favaro,et al. Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[32] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[33] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[34] Aylin Caliskan,et al. Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases , 2020, FAccT.

[35] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[36] Jan Kautz,et al. NVAE: A Deep Hierarchical Variational Autoencoder , 2020, NeurIPS.

[37] Stanislav Pidhorskyi,et al. Adversarial Latent Autoencoders , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Timo Aila,et al. A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Michal Valko,et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[40] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[41] M. Westerlund. The Emergence of Deepfake Technology: A Review , 2019, Technology Innovation Management Review.

[42] R Devon Hjelm,et al. Learning Representations by Maximizing Mutual Information Across Views , 2019, NeurIPS.

[43] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45] Jaakko Lehtinen,et al. Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[46] Soumith Chintala,et al. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[47] Pieter Abbeel,et al. PixelSNAIL: An Improved Autoregressive Generative Model , 2017, ICML.

[48] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[49] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[50] Jeff Donahue,et al. Large Scale Adversarial Representation Learning , 2019, NeurIPS.

[51] Bjorn Ommer,et al. A Note on Data Biases in Generative Models , 2020, ArXiv.

[52] Furu Wei,et al. BEiT: BERT Pre-Training of Image Transformers , 2021, ArXiv.

[53] Arno Solin,et al. Pioneer Networks: Progressively Growing Generative Autoencoder , 2018, ACCV.

[54] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[56] Julien Mairal,et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[57] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[58] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[59] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[60] Tom Minka,et al. Principled Hybrids of Generative and Discriminative Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[61] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.