Stochastic Adversarial Video Prediction

Being able to predict what may happen in the future requires an in-depth understanding of the physical and causal rules that govern the world. A model that is able to do so has a number of appealing applications, from robotic planning to representation learning. However, learning to predict raw future observations, such as frames in a video, is exceedingly challenging -- the ambiguous nature of the problem can cause a naively designed model to average together possible futures into a single, blurry prediction. Recently, this has been addressed by two distinct approaches: (a) latent variational variable models that explicitly model underlying stochasticity and (b) adversarially-trained models that aim to produce naturalistic images. However, a standard latent variable model can struggle to produce realistic results, and a standard adversarially-trained model underutilizes latent variables and fails to produce diverse predictions. We show that these distinct methods are in fact complementary. Combining the two produces predictions that look more realistic to human raters and better cover the range of possible futures. Our method outperforms prior and concurrent work in these aspects.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[3]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[4]  Zhou Wang,et al.  Complex Wavelet Structural Similarity: A New Image Similarity Index , 2009, IEEE Transactions on Image Processing.

[5]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[6]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Martial Hebert,et al.  Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Nikolay N. Ponomarenko,et al.  Image database TID2013: Peculiarities, results and perspectives , 2015, Signal Process. Image Commun..

[10]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[13]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[14]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[15]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[16]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[17]  Tamara L. Berg,et al.  Learning Temporal Transformations from Time-Lapse Videos , 2016, ECCV.

[18]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[19]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[20]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Alexei A. Efros,et al.  Colorful Image Colorization , 2016, ECCV.

[22]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.

[23]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[24]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[25]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[26]  Vincent Dumoulin,et al.  Deconvolution and Checkerboard Artifacts , 2016 .

[27]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[28]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[29]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[30]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Dieter Fox,et al.  SE3-nets: Learning rigid body motion using deep neural networks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Vighnesh Birodkar,et al.  Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[34]  Wenmin Wang,et al.  Video Imagination from a Single Image with Transformation Generation , 2017, ACM Multimedia.

[35]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[36]  Sergio Gomez Colmenarejo,et al.  Parallel Multiscale Autoregressive Density Estimation , 2017, ICML.

[37]  Franziska Meier,et al.  SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control , 2017, ArXiv.

[38]  Antonio Torralba,et al.  Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Marc'Aurelio Ranzato,et al.  Transformation-Based Models of Video Sequences , 2017, ArXiv.

[42]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[43]  Sukhendu Das,et al.  Temporal Coherency based Criteria for Predicting Video Frames using Deep Multi-stage Generative Adversarial Networks , 2017, NIPS.

[44]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[45]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[46]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[47]  Daan Wierstra,et al.  Recurrent Environment Simulators , 2017, ICLR.

[48]  Sergey Levine,et al.  Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[49]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[50]  Bernhard Schölkopf,et al.  Flexible Spatio-Temporal Networks for Video Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Separable Convolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[54]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[55]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[57]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.