StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN

Generative adversarial models (GANs) continue to produce advances in terms of the visual quality of still images, as well as the learning of temporal correlations. However, few works manage to combine these two interesting capabilities for the synthesis of video content: Most methods require an extensive training dataset to learn temporal correlations , while being rather limited in the resolution and visual quality of their output. We present a novel approach to the video synthesis problem that helps to greatly improve visual quality and drastically reduce the amount of training data and resources necessary for generating videos. Our formulation separates the spatial domain, in which individual frames are synthesized, from the temporal domain, in which motion is generated. For the spatial domain we use a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for. The expressive power of this model allows us to embed our training videos in the StyleGAN latent space. Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes. The advantageous properties of the StyleGAN space simplify the discovery of temporal correlations. We demonstrate that it suffices to train our temporal architecture on only 10 minutes of footage of 1 subject for about 6 hours. After training, our model can not only generate new portrait videos for the training subject, but also for any random subject which can be embedded in the StyleGAN space.

[1]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[2]  Stefan Roth,et al.  Markov Decision Process for Video Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[3]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[5]  Christian Theobalt,et al.  StyleRig: Rigging StyleGAN for 3D Control Over Portrait Images , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Antitza Dantcheva,et al.  InMoDeGAN: Interpretable Motion Decomposition Generative Adversarial Network for Video Generation , 2021, ArXiv.

[7]  Daniel Cohen-Or,et al.  Face identity disentanglement via latent space mapping , 2020, ACM Trans. Graph..

[8]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Luc Van Gool,et al.  Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs , 2018, ArXiv.

[11]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[12]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[13]  Bolei Zhou,et al.  Interpreting the Latent Space of GANs for Semantic Face Editing , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Bolei Zhou,et al.  GAN Inversion: A Survey , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Daniel Cohen-Or,et al.  Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Dimitris N. Metaxas,et al.  A Good Image Generator Is What You Need for High-Resolution Video Synthesis , 2021, ICLR.

[17]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Thomas Brox,et al.  Temporal Shift GAN for Large Scale Video Generation , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[19]  Sergey Tulyakov,et al.  Playable Video Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jaakko Lehtinen,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Youngjung Uh,et al.  ArrowGAN : Learning to Generate Videos by Learning Arrow of Time , 2021, Neurocomputing.

[22]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[23]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[24]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[25]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Jaakko Lehtinen,et al.  GANSpace: Discovering Interpretable GAN Controls , 2020, NeurIPS.

[27]  Minyoung Huh,et al.  Transforming and Projecting Images into Class-conditional Generative Networks , 2020, ECCV.

[28]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[29]  Jeff Donahue,et al.  Adversarial Video Generation on Complex Datasets , 2019 .

[30]  Subramanian Ramamoorthy,et al.  Lower Dimensional Kernels for Video Discriminators , 2019, Neural Networks.

[31]  Shunta Saito,et al.  Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN , 2020, International Journal of Computer Vision.

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Guoqiang Han,et al.  Coherence and Identity Learning for Arbitrary-length Face Video Generation , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).

[34]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[35]  Vighnesh Birodkar,et al.  Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[36]  Nal Kalchbrenner,et al.  Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling , 2018, ICLR.

[37]  Shengfeng He,et al.  CrowdGAN: Identity-Free Interactive Crowd Video Generation and Beyond , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.