Predicting Scene Parsing and Motion Dynamics in the Future

The ability of predicting the future is important for intelligent systems, e.g. autonomous vehicles and robots to plan early and make decisions accordingly. Future scene parsing and optical flow estimation are two key tasks that help agents better understand their environments as the former provides dense semantic information, i.e. what objects will be present and where they will appear, while the latter provides dense motion information, i.e. how the objects will move. In this paper, we propose a novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics. In particular, scene parsing enables structured motion prediction by decomposing optical flow into different groups while optical flow estimation brings reliable pixel-wise correspondence to scene parsing. By exploiting this mutually beneficial relationship, our model shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset. In addition, we also demonstrate that our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn latent representations of scene dynamics.

[1]  Martial Hebert,et al.  Patch to the Future: Unsupervised Visual Prediction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Li Fei-Fei,et al.  Unsupervised Learning of Long-Term Motion Dynamics for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  EpicFlow: Edge-preserving interpolation of correspondences for optical flow , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[9]  Scott Cohen,et al.  Forecasting Human Dynamics from Static Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Antonio Torralba,et al.  A Data-Driven Approach for Event Prediction , 2010, ECCV.

[11]  Massimiliano Pontil,et al.  Multi-Task Feature Learning , 2006, NIPS.

[12]  Yann LeCun,et al.  Predicting Deeper into the Future of Semantic Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  David F. Fouhey,et al.  Predicting Object Dynamics in Scenes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Michael J. Black,et al.  Optical Flow with Semantic Segmentation and Localized Layers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Eder Santana,et al.  Learning a Driving Simulator , 2016, ArXiv.

[18]  José García Rodríguez,et al.  A Review on Deep Learning Techniques Applied to Semantic Segmentation , 2017, ArXiv.

[19]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[20]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[21]  Raquel Urtasun,et al.  Fully Connected Deep Structured Networks , 2015, ArXiv.

[22]  Shuicheng Yan,et al.  Multi-Path Feedback Recurrent Neural Networks for Scene Parsing , 2016, AAAI.

[23]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[24]  Jian Dong,et al.  Video Scene Parsing with Predictive Feature Learning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[26]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[27]  Martial Hebert,et al.  Dense Optical Flow Prediction from a Static Image , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Richard Szeliski,et al.  A Database and Evaluation Methodology for Optical Flow , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[29]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[30]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Rudolf Mester Motion estimation revisited: an estimation-theoretic approach , 2014, 2014 Southwest Symposium on Image Analysis and Interpretation.

[32]  Viorica Patraucean,et al.  Spatio-temporal video autoencoder with differentiable memory , 2015, ArXiv.

[33]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[34]  Gabriel Kreiman,et al.  Unsupervised Learning of Visual Structure using Predictive Generative Networks , 2015, ArXiv.

[35]  Alexei A. Efros,et al.  Context Encoders: Feature Learning by Inpainting , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Sinisa Todorovic,et al.  Scene Labeling Using Beam Search under Mutex Constraints , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.