Predicting Video Frames Using Feature Based Locally Guided Objectives

This paper presents feature reconstruction based approach using Generative Adversarial Networks (GAN) to solve the problem of predicting future frames from natural video scenes. Recent GAN based methods often generate blurry outcomes and fail miserably in case of long-range prediction. Our proposed method incorporates an intermediate feature generating GAN to minimize the disparity between the ground truth and predicted outputs. For this, we propose two novel objective functions: (a) Locally Guided Gram Loss (LGGL) and (b) Multi-Scale Correlation Loss (MSCL) to further enhance the quality of the predicted frames. LGGL aides the feature generating GAN to maximize the similarity between the intermediate features of the ground-truth and the network output by constructing Gram matrices from locally extracted patches over several levels of the generator. MSCL incorporates a correlation based objective to effectively model the temporal relationships between the predicted and ground-truth frames at the frame generating stage. Our proposed model is end-to-end trainable and exhibits superior performance compared to the state-of-the-art on four real-world benchmark video datasets.

[1]  Honglak Lee,et al.  Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[2]  Allen R. Tannenbaum,et al.  A new distance measure based on generalized Image Normalized Cross-Correlation for robust video tracking and image recognition , 2013, Pattern Recognit. Lett..

[3]  Antonio Torralba,et al.  Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Anurag Mittal,et al.  Deep Neural Networks with Inexact Matching for Person Re-Identification , 2016, NIPS.

[5]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[6]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[8]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[10]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[11]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[12]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  J. P. Lewis Fast Normalized Cross-Correlation , 2010 .

[14]  Uwe D. Hanebeck,et al.  Template matching using fast normalized cross correlation , 2001, SPIE Defense + Commercial Sensing.

[15]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[16]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[18]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[19]  Tamara L. Berg,et al.  Learning Temporal Transformations from Time-Lapse Videos , 2016, ECCV.

[20]  Jitendra Malik,et al.  Large displacement optical flow , 2009, CVPR.

[21]  Bernhard Schölkopf,et al.  Flexible Spatio-Temporal Networks for Video Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  Sukhendu Das,et al.  Temporal Coherency based Criteria for Predicting Video Frames using Deep Multi-stage Generative Adversarial Networks , 2017, NIPS.

[26]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[27]  Jianwen Luo,et al.  A fast normalized cross-correlation calculation method for motion estimation , 2010, IEEE Transactions on Ultrasonics, Ferroelectrics and Frequency Control.

[28]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.