Wide and Narrow: Video Prediction from Context and Motion

Video prediction, forecasting the future frames from a sequence of input frames, is a challenging task since the view changes are influenced by various factors, such as the global context surrounding the scene and local motion dynamics. In this paper, we propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks. To capture the local motion pattern of objects, we devise local filter memory networks that generate adaptive filter kernels by storing the prototypical motion of moving objects in the memory. We further present global context propagation networks that iteratively aggregate the non-local neighboring representations to preserve the contextual information over the past frames. The proposed framework, utilizing the outputs from both networks, can address blurry predictions and color distortion. We conduct experiments on Caltech pedestrian and UCF101 datasets, and demonstrate state-of-the-art results. Especially for multi-step prediction, we obtain an outstanding performance in quantitative and qualitative evaluation.

[1]  Ronald A. Rensink The Dynamic Representation of Scenes , 2000 .

[2]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[4]  Andreas Geiger,et al.  Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[5]  Svetha Venkatesh,et al.  Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Nicolas Thome,et al.  Disentangling Physical Dynamics From Unknown Factors for Unsupervised Video Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Viorica Patraucean,et al.  Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[9]  Yunbo Wang,et al.  Eidetic 3D LSTM: A Model for Video Prediction and Beyond , 2019, ICLR.

[10]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[11]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[12]  Petros Koumoutsakos,et al.  ContextVP: Fully Context-Aware Video Prediction , 2017, ECCV.

[13]  Sergio Orts-Escolano,et al.  A Review on Deep Learning Techniques for Video Prediction , 2020, IEEE transactions on pattern analysis and machine intelligence.

[14]  Philip S. Yu,et al.  Memory in Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity From Spatiotemporal Dynamics , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Kwanghoon Sohn,et al.  Multi-Task Self-Supervised Visual Representation Learning for Monocular Road Segmentation , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[16]  Trevor Darrell,et al.  Disentangling Propagation and Generation for Video Prediction , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[18]  Sanja Fidler,et al.  Efficient and Information-Preserving Future Frame Prediction and Beyond , 2020, ICLR.

[19]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[21]  José M. F. Moura,et al.  Few-Shot Human Motion Prediction via Meta-learning , 2018, ECCV.

[22]  Jon Barker,et al.  SDC-Net: Video Prediction Using Spatially-Displaced Convolution , 2018, ECCV.

[23]  Yong Man Ro,et al.  Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[25]  Serge J. Belongie,et al.  Controllable Video Generation with Sparse Trajectories , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[27]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[28]  Qingming Huang,et al.  Spatiotemporal CNN for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  David Wingate,et al.  Video Extrapolation with an Invertible Linear Embedding , 2019, ArXiv.

[30]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[31]  Stefan Schaal,et al.  Memory-based neural networks for robot learning , 1995, Neurocomputing.

[32]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[33]  Euntai Kim,et al.  Kernelized Memory Network for Video Object Segmentation , 2020, ECCV.

[34]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[35]  Yinhe Han,et al.  Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity and Temporal-Consistency Video Prediction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Philip S. Yu,et al.  PredRNN: Recurrent Neural Networks for Predictive Learning using Spatiotemporal LSTMs , 2017, NIPS.

[37]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  B. Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[39]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Separable Convolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[41]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Patrick Gallinari,et al.  Stochastic Latent Residual Video Prediction , 2020, ICML.

[43]  Jaegul Choo,et al.  Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation , 2021, AAAI.

[44]  Bumsub Ham,et al.  Learning Memory-Guided Normality for Anomaly Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[46]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Shenghua Gao,et al.  Future Frame Prediction for Anomaly Detection - A New Baseline , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[49]  Antoni B. Chan,et al.  Learning Dynamic Memory Networks for Object Tracking , 2018, ECCV.

[50]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Yunbo Wang,et al.  Probabilistic Video Prediction From Noisy Data With a Posterior Confidence , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Stephen Lin,et al.  GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[53]  Daan Wierstra,et al.  One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[54]  Ming-Hsuan Yang,et al.  Flow-Grounded Spatial-Temporal Video Prediction from Still Images , 2018, ECCV.

[55]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[56]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Min-Gyu Park,et al.  Predicting Future Frames Using Retrospective Cycle GAN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[59]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[60]  Bingbing Ni,et al.  Video Prediction via Selective Sampling , 2018, NeurIPS.