Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation

Semantically labeling every pixel in a video is a very challenging task as video is an information-intensive media with complex spatio-temporal dependence. We present in this paper a novel deep convolutional network architecture, called deep spatio-temporal fully convolutional networks (DST-FCN), which leverages both spatial and temporal dependencies among pixels and voxels by training them in an end-to-end manner. Specifically, we introduce a two-stream network by learning the deep spatio-temporal dependence, in which a 2D FCN followed by the convolutional long short-term memory (ConvLSTM) is employed on the pixel level and a 3-D FCN is exploited on the voxel level. Our model differs from conventional FCN in that it not only extends FCN by adding ConvLSTM on the pixel level for exploring long-term dependence, but also proposes 3-D FCN to enable voxel level prediction. On two benchmarks of A2D and CamVid, our DST-FCN achieves superior results to state-of-the-art techniques. More remarkably, we obtain to-date the best reported results: 45.0% per-label accuracy on A2D and 68.8% mean IoU on CamVid.

[1]  Tao Mei,et al.  Deep Quantization: Encoding Convolutional Activations with Deep Generative Model , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Hujun Bao,et al.  Spatio-Temporal Video Segmentation of Static Scenes and Its Applications , 2015, IEEE Transactions on Multimedia.

[3]  Roberto Cipolla,et al.  Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[4]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5]  Tao Mei,et al.  Learning hierarchical video representation for action recognition , 2017, International Journal of Multimedia Information Retrieval.

[6]  Xiaoxiao Li,et al.  Semantic Image Segmentation via Deep Parsing Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[8]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Roberto Cipolla,et al.  MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving , 2016, 2018 IEEE Intelligent Vehicles Symposium (IV).

[10]  Ronan Collobert,et al.  Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[11]  Philip H. S. Torr,et al.  Combining Appearance and Structure from Motion Features for Road Scene Understanding , 2009, BMVC.

[12]  한보형,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015 .

[13]  Svetlana Lazebnik,et al.  Superparsing - Scalable Nonparametric Image Parsing with Superpixels , 2010, International Journal of Computer Vision.

[14]  Irfan A. Essa,et al.  Geometric Context from Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[16]  Charless C. Fowlkes,et al.  Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation , 2016, ECCV.

[17]  Xuming He,et al.  Multiclass semantic video segmentation with object-level active inference , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Jason J. Corso,et al.  Temporally consistent multi-class video-object segmentation with the Video Graph-Shifts algorithm , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[20]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Xiao-Ping Zhang,et al.  Efficient Heuristic Methods for Multimodal Fusion and Concept Fusion in Video Concept Detection , 2015, IEEE Transactions on Multimedia.

[22]  Charless C. Fowlkes,et al.  Laplacian Reconstruction and Refinement for Semantic Segmentation , 2016, ArXiv.

[23]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Xuming He,et al.  Learning Dynamic Hierarchical Models for Anytime Scene Labeling , 2016, ECCV.

[25]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[26]  Tao Mei,et al.  Super Fast Event Recognition in Internet Videos , 2015, IEEE Transactions on Multimedia.

[27]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Guosheng Lin,et al.  Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Nicu Sebe,et al.  Joint Graph Learning and Video Segmentation via Multiple Cues and Topology Calibration , 2016, ACM Multimedia.

[31]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[32]  Vladlen Koltun,et al.  Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chenliang Xu,et al.  Streaming Hierarchical Video Segmentation , 2012, ECCV.

[34]  Mei Han,et al.  Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35]  Chenliang Xu,et al.  Actor-Action Semantic Segmentation with Grouping Process Models , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Chong-Wah Ngo,et al.  Video Event Detection Using Motion Relativity and Feature Selection , 2014, IEEE Transactions on Multimedia.

[37]  Xiaochun Cao,et al.  Fashion Parsing With Video Context , 2015, IEEE Trans. Multim..

[38]  Chenliang Xu,et al.  Can humans fly? Action understanding with multiple classes of actors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ruigang Yang,et al.  Semantic Segmentation of Urban Scenes Using Dense Depth Maps , 2010, ECCV.

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  C. V. Jawahar,et al.  Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[42]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Wei Liu,et al.  ParseNet: Looking Wider to See Better , 2015, ArXiv.