Learning Temporal Transformations from Time-Lapse Videos

Based on life-long observations of physical, chemical, and biologic phenomena in the natural world, humans can often easily picture in their minds what an object will look like in the future. But, what about computers? In this paper, we learn computational models of object transformations from time-lapse videos. In particular, we explore the use of generative models to create depictions of objects at future times. These models explore several different prediction tasks: generating a future state given a single depiction of an object, generating a future state given two depictions of an object at different times, and generating future states recursively in a recurrent framework. We provide both qualitative and quantitative evaluations of the generated results, and also conduct a human evaluation to compare variations of our models.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[5]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[6]  Steven M. Seitz,et al.  3D Time-Lapse Reconstruction from Internet Photos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[8]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[9]  Ruslan Salakhutdinov,et al.  Learning Stochastic Feedforward Neural Networks , 2013, NIPS.

[10]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[11]  Antonio Torralba,et al.  Anticipating the future by watching unlabeled video , 2015, ArXiv.

[12]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[13]  Steven M. Seitz,et al.  Time-lapse mining from internet photos , 2015, ACM Trans. Graph..

[14]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Frédo Durand,et al.  Data-driven hallucination of different times of day from a single outdoor photo , 2013, ACM Trans. Graph..

[16]  Tamara L. Berg,et al.  Temporal Perception and Prediction in Ego-Centric Video , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Antonio Torralba,et al.  A Data-Driven Approach for Event Prediction , 2010, ECCV.

[18]  Thomas Brox,et al.  Learning to generate chairs with convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[20]  Renjie Liao,et al.  Learning to generate images with perceptual similarity metrics , 2015, 2017 IEEE International Conference on Image Processing (ICIP).

[21]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[23]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[24]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[27]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[28]  Martial Hebert,et al.  Activity Forecasting , 2012, ECCV.

[29]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[30]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[31]  Edward H. Adelson,et al.  Discovering states and transformations in image collections , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[33]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.