论文信息 - A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

Predicting the future frames of a video is a challenging task, in part due to the underlying stochastic real-world phenomena. Prior approaches to solve this task typically estimate a latent prior characterizing this stochasticity, however do not account for the predictive uncertainty of the (deep learning) model. Such approaches often derive the training signal from the mean-squared error (MSE) between the generated frame and the ground truth, which can lead to sub-optimal training, especially when the predictive uncertainty is high. Towards this end, we introduce Neural Uncertainty Quantifier (NUQ) a stochastic quantification of the model’s predictive uncertainty, and use it to weigh the MSE loss. We propose a hierarchical, variational framework to derive NUQ in a principled manner using a deep, Bayesian graphical model. Our experiments on three benchmark stochastic video prediction datasets show that our proposed framework trains more effectively compared to the state-of-the-art models (especially when the training sets are small), while demonstrating better video generation quality and diversity against several evaluation metrics.

[1] Sergey Levine,et al. Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[2] Robert W. Heath,et al. Rate Bounds on SSIM Index of Quantized Images , 2008, IEEE Transactions on Image Processing.

[3] L. Wolf,et al. Hierarchical Patch VAE-GAN: Generating Diverse Videos from a Single Sample , 2020, NeurIPS.

[4] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5] C. Robert. Simulation of truncated normal variables , 2009, 0907.4010.

[6] Nicolas Thome,et al. Disentangling Physical Dynamics From Unknown Factors for Unsupervised Video Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jitendra Malik,et al. Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8] Ole Winther,et al. Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[9] Marc'Aurelio Ranzato,et al. Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[10] Christopher Burgess,et al. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[11] Yinhe Han,et al. Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[13] Timo Aila,et al. A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Uri Shalit,et al. Structured Inference Networks for Nonlinear State Space Models , 2016, AAAI.

[15] Zicheng Liu,et al. HP-GAN: Probabilistic 3D Human Motion Prediction via GAN , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16] David Lopez-Paz,et al. Single-Model Uncertainties for Deep Learning , 2018, NeurIPS.

[17] Gang Wang,et al. Background Modeling and Referencing for Moving Cameras-Captured Surveillance Video Coding in HEVC , 2018, IEEE Transactions on Multimedia.

[18] Sanja Fidler,et al. Efficient and Information-Preserving Future Frame Prediction and Beyond , 2020, ICLR.

[19] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[20] Jaesik Park,et al. Future Video Synthesis With Object Motion Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Aaron C. Courville,et al. Improved Training of Wasserstein GANs , 2017, NIPS.

[22] Mark J. F. Gales,et al. Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[23] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[24] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[25] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[26] Eric P. Xing,et al. Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.

[28] Hema Swetha Koppula,et al. Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[30] Junhee Seok,et al. Estimation with Uncertainty via Conditional Generative Adversarial Networks , 2020, Sensors.

[31] David A. Knowles. Stochastic gradient variational Bayes for gamma approximating distributions , 2015, 1509.01631.

[32] Max Welling,et al. VAE with a VampPrior , 2017, AISTATS.

[33] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[34] Sergey Levine,et al. Robustness to Out-of-Distribution Inputs via Task-Aware Generative Uncertainty , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[35] Ullrich Kothe,et al. Training Normalizing Flows with the Information Bottleneck for Competitive Generative Classification , 2020, NeurIPS.

[36] Aaron C. Courville,et al. Improved Conditional VRNNs for Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Shenghua Gao,et al. Future Frame Prediction for Anomaly Detection - A New Baseline , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Jan Kautz,et al. NVAE: A Deep Hierarchical Variational Autoencoder , 2020, NeurIPS.

[39] David M. Blei,et al. The Generalized Reparameterization Gradient , 2016, NIPS.

[40] Gabriel Kreiman,et al. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[41] Petros Koumoutsakos,et al. ContextVP: Fully Context-Aware Video Prediction , 2017, ECCV.

[42] Wen-Hsiao Peng,et al. SME-Net: Sparse Motion Estimation for Parametric Video Prediction Through Reinforcement Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] Martial Hebert,et al. The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[45] Rob Fergus,et al. Stochastic Video Generation with a Learned Prior , 2018, ICML.

[46] William R. Clements,et al. Estimating Risk and Uncertainty in Deep Reinforcement Learning , 2019, ArXiv.

[47] Shakir Mohamed,et al. Implicit Reparameterization Gradients , 2018, NeurIPS.

[48] Alexei A. Efros,et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[50] Alex Kendall,et al. What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[51] B. Caputo,et al. Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[52] Steven L. Waslander,et al. BayesOD: A Bayesian Approach for Uncertainty Estimation in Deep Object Detectors , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[53] Chandrasekar Vuppalapati,et al. Human AI Symbiosis: The Role of Artificial Intelligence in Stratifying High-Risk Outpatient Senior Citizen Fall Events in a Non-connected Environments , 2020, IHSI.

[54] Abhinav Gupta,et al. Compositional Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55] Vighnesh Birodkar,et al. Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[56] Ye Wang,et al. LUVLi Face Alignment: Estimating Landmarks’ Location, Uncertainty, and Visibility Likelihood , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Seunghoon Hong,et al. Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[58] Cordelia Schmid,et al. Relational Action Forecasting , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Julien Cornebise,et al. Weight Uncertainty in Neural Network , 2015, ICML.

[60] Min-Gyu Park,et al. Predicting Future Frames Using Retrospective Cycle GAN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Christian Osendorfer,et al. Learning Stochastic Recurrent Networks , 2014, NIPS 2014.

[62] Anoop Cherian,et al. Sound2Sight: Generating Visual Dynamics from Sound and Context , 2020, ECCV.

[63] Yoshua Bengio,et al. A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[64] Gang Hua,et al. CVAE-GAN: Fine-Grained Image Generation through Asymmetric Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65] Juan Carlos Niebles,et al. Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.