论文信息 - Diffusion Models for Video Prediction and Infilling

Diffusion Models for Video Prediction and Infilling

Predicting and anticipating future outcomes or reasoning about missing information in a sequence are critical skills for agents to be able to make intelligent decisions. This requires strong, temporally coherent generative capabilities. Diffusion models have shown remarkable success in several generative tasks, but have not been extensively explored in the video domain. We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions, and introduces a new conditioning technique during training. By varying the mask we condition on, the model is able to perform video prediction, infilling, and upsampling. Due to our simple conditioning scheme, we can utilize the same architecture as used for unconditional training, which allows us to train the model in a conditional and unconditional fashion at the same time. We evaluate RaMViD on two benchmark datasets for video prediction, on which we achieve state-of-the-art results, and one for video generation. High-resolution videos are provided at https://sites.google.com/view/video-diffusion-prediction.

[1] Sarthak Mittal,et al. From Points to Functions: Infinite-dimensional Representations in Diffusion Models , 2022, ArXiv.

[2] Frank Wood,et al. Flexible Diffusion Modeling of Long Videos , 2022, NeurIPS.

[3] Vikram S. Voleti,et al. MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation , 2022, ArXiv.

[4] David J. Fleet,et al. Video Diffusion Models , 2022, NeurIPS.

[5] S. Mandt,et al. Diffusion Probabilistic Modeling for Video Generation , 2022, Entropy.

[6] L. Gool,et al. RePaint: Inpainting using Denoising Diffusion Probabilistic Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Karsten Kreis,et al. Score-Based Generative Modeling with Critically-Damped Langevin Diffusion , 2021, ICLR.

[8] Christian Etmann,et al. Conditional Image Generation with Score-Based Diffusion Models , 2021, ArXiv.

[9] Jian Liang,et al. NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[10] David J. Fleet,et al. Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[11] Cordelia Schmid,et al. CCVS: Context-aware Controllable Video Synthesis , 2021, NeurIPS.

[12] Stefano Ermon,et al. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation , 2021, NeurIPS.

[13] Sergey Levine,et al. FitVid: Overfitting in Pixel-Level Video Prediction , 2021, ArXiv.

[14] B. Schölkopf,et al. Diffusion Based Representation Learning , 2021, ICML.

[15] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[16] Pieter Abbeel,et al. VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.

[17] David J. Fleet,et al. Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Li Fei-Fei,et al. Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Jimmy Ba,et al. Clockwork Variational Autoencoders , 2021, NeurIPS.

[20] Prafulla Dhariwal,et al. Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[21] Abhishek Kumar,et al. Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[22] Heiga Zen,et al. WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[23] Noah Snavely,et al. Learning Gradient Fields for Shape Generation , 2020, ECCV.

[24] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[25] Evgeny Burnaev,et al. Latent Video Transformer , 2020, VISIGRAPP.

[26] Antonis A. Argyros,et al. A Review on Deep Learning Techniques for Video Prediction , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Fergal Cotter,et al. Probabilistic Future Prediction for Video Scene Understanding , 2020, ECCV.

[28] Diego de Las Casas,et al. Transformation-based Adversarial Video Prediction on Large-Scale Data , 2020, ArXiv.

[29] Stefano Ermon,et al. Permutation Invariant Graph Generation via Score-Based Generative Modeling , 2020, AISTATS.

[30] Tae-Kyun Kim,et al. A Review on Object Pose Recovery: from 3D Bounding Box Detectors to Full 6D Pose Estimators , 2020, Image Vis. Comput..

[31] Subramanian Ramamoorthy,et al. Lower Dimensional Kernels for Video Discriminators , 2019, Neural Networks.

[32] Wenjun Zeng,et al. Predicting Future Instance Segmentation with Contextual Pyramid ConvLSTMs , 2019, ACM Multimedia.

[33] Jeff Donahue,et al. Efficient Video Generation on Complex Datasets , 2019, ArXiv.

[34] Maomao Li,et al. Stochastic Video Generation with Disentangled Representations , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[35] Jakob Uszkoreit,et al. Scaling Autoregressive Video Models , 2019, ICLR.

[36] Iracema Dulley,et al. A short note , 2019, On the Emic Gesture.

[37] Sjoerd van Steenkiste,et al. Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[38] Masanori Koyama,et al. Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN , 2018, International Journal of Computer Vision.

[39] Ruben Villegas,et al. Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[40] Luc Van Gool,et al. Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs , 2018, ArXiv.

[41] Xiaoming Liu,et al. Recurrent Flow-Guided Semantic Forecasting , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[42] Ulrich Neumann,et al. Stochastic Dynamics for Video Infilling , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.

[44] Yun Fu,et al. Human Action Recognition and Prediction: A Survey , 2018, International Journal of Computer Vision.

[45] Bernhard Schölkopf,et al. Deep Energy Estimator Networks , 2018, ArXiv.

[46] Sergey Levine,et al. Stochastic Adversarial Video Prediction , 2018, ArXiv.

[47] Rob Fergus,et al. Stochastic Video Generation with a Learned Prior , 2018, ICML.

[48] Jan Kautz,et al. Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.

[50] Sergey Levine,et al. Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[51] Juan Carlos Niebles,et al. Visual Forecasting by Imitating Dynamics in Natural Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52] Jan Kautz,et al. MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53] Antonio Torralba,et al. Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[55] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[56] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[57] Martial Hebert,et al. Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[58] David A. Forsyth,et al. Representation Learning , 2015, Computer.

[59] Surya Ganguli,et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[60] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[61] Pascal Vincent,et al. A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[62] Aapo Hyvärinen,et al. Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..