论文信息 - Improved Conditional VRNNs for Video Prediction

Improved Conditional VRNNs for Video Prediction

Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.

Aaron C. Courville | Nicolas Ballas | Lluís Castrejón | Nicolas Ballas | Lluís Castrejón

[1] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[2] Max Welling,et al. Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[3] Marc'Aurelio Ranzato,et al. Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[4] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[5] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[6] Petros Koumoutsakos,et al. ContextVP: Fully Context-Aware Video Prediction , 2017, ECCV.

[7] Ole Winther,et al. How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks , 2016, ICML 2016.

[8] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[9] Ole Winther,et al. Ladder Variational Autoencoders , 2016, NIPS.

[10] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[11] Sergio Gomez Colmenarejo,et al. Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.

[12] Michael I. Jordan,et al. An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[13] Alexei A. Efros,et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Sergey Levine,et al. Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[16] Sergey Levine,et al. VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[17] Dustin Tran,et al. Hierarchical Variational Models , 2015, ICML.

[18] Yoshua Bengio,et al. A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[19] Ole Winther,et al. Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[20] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Nikos Komodakis,et al. Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[22] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Sjoerd van Steenkiste,et al. Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[24] Eero P. Simoncelli,et al. Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[25] Ronald J. Williams,et al. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[26] Sergey Levine,et al. Stochastic Adversarial Video Prediction , 2018, ArXiv.

[27] Alex Graves,et al. Video Pixel Networks , 2016, ICML.

[28] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[29] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[30] Seunghoon Hong,et al. Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[31] Jan Kautz,et al. MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33] Rob Fergus,et al. Stochastic Video Generation with a Learned Prior , 2018, ICML.

[34] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[35] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Gabriel Kreiman,et al. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[37] Ole Winther,et al. BIVA: A Very Deep Hierarchy of Latent Variables for Generative Modeling , 2019, NeurIPS.

[38] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.

[39] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Daan Wierstra,et al. Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[41] Ole Winther,et al. Auxiliary Deep Generative Models , 2016, ICML.

[42] Vighnesh Birodkar,et al. Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.