CCVS: Context-aware Controllable Video Synthesis

This presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones, with several new key elements for improved spatial resolution and realism: It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control. The prediction model is doubly autoregressive, in the latent space of an autoencoder for forecasting, and in image space for updating contextual information, which is also used to enforce spatio-temporal consistency through a learnable optical flow module. Adversarial training of the autoencoder in the appearance and temporal domains is used to further improve the realism of its output. A quantizer inserted between the encoder and the transformer in charge of forecasting future frames in latent space (and its inverse inserted between the transformer and the decoder) adds even more flexibility by affording simple mechanisms for handling multimodal ancillary information for controlling the synthesis process (e.g., a few sample frames, an audio track, a trajectory in image space) and taking into account the intrinsically uncertain nature of the future by allowing multiple predictions. Experiments with an implementation of the proposed approach give very good qualitative and quantitative results on multiple tasks and standard benchmarks.

[1]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[2]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[5]  Yoshua Bengio,et al.  Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[13]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[18]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[19]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[21]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Antonio Torralba,et al.  Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[24]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[26]  Peter V. Gehler,et al.  Semantic Video CNNs Through Representation Warping , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Sergey Levine,et al.  Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[28]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[29]  Xiaoou Tang,et al.  LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[31]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[32]  Douglas Eck,et al.  Music Transformer , 2018, 1809.04281.

[33]  Serge J. Belongie,et al.  Controllable Video Generation with Sparse Trajectories , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[35]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[36]  Luc Van Gool,et al.  Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs , 2018, ArXiv.

[37]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[38]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[39]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[41]  Aaron C. Courville,et al.  Improved Conditional VRNNs for Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Joon Son Chung,et al.  You Said That?: Synthesising Talking Faces from Audio , 2019, International Journal of Computer Vision.

[43]  Jeff Donahue,et al.  Adversarial Video Generation on Complex Datasets , 2019 .

[44]  Xiaogang Wang,et al.  Video Generation From Single Semantic Label Map , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Alexandre Lacoste,et al.  Quantifying the Carbon Emissions of Machine Learning , 2019, ArXiv.

[47]  Min Sun,et al.  Point-to-Point Video Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Seunghoon Hong,et al.  Diversity-Sensitive Conditional Generative Adversarial Networks , 2019, ICLR.

[49]  Trevor Darrell,et al.  Disentangling Propagation and Generation for Video Prediction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Stefan Winkler,et al.  The Unusual Effectiveness of Averaging in GAN Training , 2018, ICLR.

[54]  Anoop Cherian,et al.  Sound2Sight: Generating Visual Dynamics from Sound and Context , 2020, ECCV.

[55]  S. Levine,et al.  VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation , 2019, ICLR.

[56]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[57]  Karan Sapra,et al.  Hierarchical Multi-Scale Attention for Semantic Segmentation , 2020, ArXiv.

[58]  Shunta Saito,et al.  Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN , 2020, International Journal of Computer Vision.

[59]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Saeid Nahavandi,et al.  Deep learning for deepfakes creation and detection: A survey , 2019, Comput. Vis. Image Underst..

[61]  A. Dantcheva,et al.  G3AN: Disentangling Appearance and Motion for Video Generation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Arun Mallya,et al.  World-Consistent Video-to-Video Synthesis , 2020, ECCV.

[63]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[64]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[65]  Diego de Las Casas,et al.  Transformation-based Adversarial Video Prediction on Large-Scale Data , 2020, ArXiv.

[66]  Payal Dhar,et al.  The carbon impact of artificial intelligence , 2020, Nature Machine Intelligence.

[67]  Jaesik Park,et al.  Future Video Synthesis With Object Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Jonathan T. Barron,et al.  What Matters in Unsupervised Optical Flow , 2020, ECCV.

[69]  P. Gallinari,et al.  Stochastic Latent Residual Video Prediction , 2020, ICML.

[70]  Subramanian Ramamoorthy,et al.  Lower Dimensional Kernels for Video Discriminators , 2019, Neural Networks.

[71]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[72]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[73]  Wangmeng Zuo,et al.  Learning Flow-based Feature Warping for Face Frontalization with Illumination Inconsistent Supervision , 2020, ECCV.

[74]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[75]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[76]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[77]  Li Fei-Fei,et al.  Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Dimitris N. Metaxas,et al.  A Good Image Generator Is What You Need for High-Resolution Video Synthesis , 2021, ICLR.

[79]  Rethinking Attention with Performers , 2020, ICLR.

[80]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Evgeny Burnaev,et al.  Latent Video Transformer , 2020, VISIGRAPP.

[82]  Sergey Tulyakov,et al.  Playable Video Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).