Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks

We study the problem of synthesizing a number of likely future frames from a single input image. In contrast to traditional methods, which have tackled this problem in a deterministic or non-parametric way, we propose a novel approach that models future frames in a probabilistic manner. Our probabilistic model makes it possible for us to sample and synthesize many possible future frames from a single input image. Future frame synthesis is challenging, as it involves low- and high-level image and motion understanding. We propose a novel network structure, namely a Cross Convolutional Network to aid in synthesizing future frames; this network structure encodes image and motion information as feature maps and convolutional kernels, respectively. In experiments, our model performs well on synthetic data, such as 2D shapes and animated game sprites, as well as on real-wold videos. We also show that our model can be applied to tasks such as visual analogy-making, and present an analysis of the learned network representations.

[1]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[2]  Edward H. Adelson,et al.  Layered representation for motion analysis , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[3]  E. Adelson,et al.  Slow and Smooth: A Bayesian theory for the combination of local motion signals in human vision , 1998 .

[4]  Richard Szeliski,et al.  Video textures , 2000, SIGGRAPH.

[5]  E. Shechtman,et al.  Transactions on Pattern Analysis and Machine Intelligence 1 Space-time Video Completion Draft Transactions on Pattern Analysis and Machine Intelligence 2 , 2022 .

[6]  Eero P. Simoncelli,et al.  A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients , 2000, International Journal of Computer Vision.

[7]  David J. Fleet,et al.  Design and Use of Linear Models for Image Motion Analysis , 2000, International Journal of Computer Vision.

[8]  Alan L. Yuille,et al.  Ideal Observers for Detecting Motion: Correspondence Noise , 2005, NIPS.

[9]  Michael J. Black,et al.  On the Spatial Statistics of Optical Flow , 2005, ICCV.

[10]  David Salesin,et al.  Panoramic video textures , 2005, SIGGRAPH 2005.

[11]  David Salesin,et al.  Panoramic video textures , 2005, ACM Trans. Graph..

[12]  Michael J. Black,et al.  On the Spatial Statistics of Optical Flow , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[14]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Frédo Durand,et al.  Eulerian video magnification for revealing subtle changes in the world , 2012, ACM Trans. Graph..

[16]  Steven M. Drucker,et al.  Cliplets: juxtaposing still and dynamic imagery , 2012, UIST.

[17]  Neel Joshi,et al.  Automated video looping with progressive dynamism , 2013, ACM Trans. Graph..

[18]  Frédo Durand,et al.  Refraction Wiggles for Measuring Fluid Depth and Velocity from Video , 2014, ECCV.

[19]  Arnold W. M. Smeulders,et al.  Déjà Vu: - Motion Prediction in Static Images , 2018, ECCV.

[20]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[21]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[22]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[23]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Martial Hebert,et al.  Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[26]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[27]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[28]  Ali Farhadi,et al.  Visalogy: Answering Visual Analogy Questions , 2015, NIPS.

[29]  Yuting Zhang,et al.  Deep Visual Analogy-Making , 2015, NIPS.

[30]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[31]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[32]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[33]  Ali Farhadi,et al.  Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks , 2016, ECCV.

[34]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[35]  Song-Chun Zhu,et al.  Synthesizing Dynamic Textures and Sounds by Spatial-Temporal Generative ConvNet , 2016, ArXiv.

[36]  Luc Van Gool,et al.  Dynamic filter networks for predicting unobserved views , 2016 .

[37]  Jitendra Malik,et al.  View Synthesis by Appearance Flow , 2016, ECCV.

[38]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[39]  Charles Blundell,et al.  Early Visual Concept Learning with Unsupervised Deep Learning , 2016, ArXiv.

[40]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[41]  Antonio Torralba,et al.  Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[43]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[44]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[45]  Yang Yu,et al.  Unsupervised Representation Learning with Deep Convolutional Neural Network for Remote Sensing Images , 2017, ICIG.

[46]  Song-Chun Zhu,et al.  Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).