Generating Future Frames with Mask-Guided Prediction

Current approaches in video prediction tend to hallucinate the future frames directly or learn global motion transformation from the entire scene. However, it is difficult for these methods without instance-aware mechanism to learn the underlying structures, dynamics and appearances of foreground and background elements simultaneously, especially when it comes to long-term prediction. In this paper, we propose an explicit instance-level prediction approach to tackle this issue and present a novel mask-guided dual network. We utilize instance masks to extract active objects from the videos, and design two LSTM branches to predict the future dynamics and appearances for objects and backgrounds individually. Superior than most recent skeleton-aided methods that only focus on single human object with two-stage procedure, our proposed network can predict instances from other categories and be trained end-to-end with a joint loss. We evaluate our approach on KTH, Penn Action and Running Horse datasets, and achieve promising results in both quality and quantity.

[1]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[2]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[3]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[4]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[5]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[6]  Luc Van Gool,et al.  Pose Guided Person Image Generation , 2017, NIPS.

[7]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Kaiming He,et al.  Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[11]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[12]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[14]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[15]  Antonio Torralba,et al.  Generating the Future with Adversarial Transformers , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[17]  Rahul Sukthankar,et al.  Articulated motion discovery using pairs of trajectories , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[19]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).