论文信息 - NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

In this paper, we present NUWA-Inﬁnity, a generative model for inﬁnite visual synthesis, which is deﬁned as the task of generating arbitrarily-sized high-resolution images or long-duration videos. An autoregressive over autoregressive generation mechanism is proposed to deal with this variable-size generation task, where a global patch-level autoregressive model considers the dependencies between patches, and a local token-level autoregressive model considers dependencies between visual tokens within each patch. A Nearby Context Pool (NCP) is introduced to cache-related patches already generated as the context for the current patch being generated, which can signiﬁcantly save computation costs without sacriﬁcing patch-level dependency modeling. An Arbitrary Direction Controller (ADC) is used to decide suitable generation orders for different visual synthesis tasks and learn order-aware positional embeddings. Compared to DALL·E, Imagen and Parti, NUWA-Inﬁnity can generate high-resolution images with arbitrary sizes and support long-duration video generation additionally. Compared to NUWA, which also covers images and videos, NUWA-Inﬁnity has superior visual synthesis capabilities in terms of resolution and variable-size generation. The GitHub link is https://github.com/microsoft/NUWA . The homepage link is https://nuwa-infinity.microsoft.com .

[1] Jing Yu Koh,et al. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[2] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[3] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[4] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[5] W. Freeman,et al. MaskGIT: Masked Generative Image Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Fang Wen,et al. Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jian Liang,et al. NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[8] Jing Yu Koh,et al. Vector-quantized Image Modeling with Improved VQGAN , 2021, ICLR.

[9] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Sergey Tulyakov,et al. InfinityGAN: Towards Infinite-Pixel Image Synthesis , 2021, ICLR.

[11] Jacek Tabor,et al. LocoGAN - Locally Convolutional GAN , 2020, Comput. Vis. Image Underst..

[12] Jingren Zhou,et al. M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis via Non-Autoregressive Generative Transformers , 2021, 2105.14211.

[13] Chang Zhou,et al. CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[14] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[15] Mohamed Elhoseiny,et al. Aligning Latent and Image Spaces to Connect the Unconnectable , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[17] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[18] B. Ommer,et al. Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Varun Jampani,et al. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20] Davis Liang,et al. Improve Transformer Models with Better Relative Position Embeddings , 2020, FINDINGS.

[21] Jiasen Lu,et al. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers , 2020, EMNLP.

[22] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[23] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[24] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[25] Ruben Villegas,et al. High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks , 2019, NeurIPS.

[26] Jeff Donahue,et al. Adversarial Video Generation on Complex Datasets , 2019 .

[27] Sjoerd van Steenkiste,et al. Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[28] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[29] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[30] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[31] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.