论文信息 - Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation

Learning Deep Spatio-Temporal Dependence for Semantic Video Segmentation

Semantically labeling every pixel in a video is a very challenging task as video is an information-intensive media with complex spatio-temporal dependence. We present in this paper a novel deep convolutional network architecture, called deep spatio-temporal fully convolutional networks (DST-FCN), which leverages both spatial and temporal dependencies among pixels and voxels by training them in an end-to-end manner. Specifically, we introduce a two-stream network by learning the deep spatio-temporal dependence, in which a 2D FCN followed by the convolutional long short-term memory (ConvLSTM) is employed on the pixel level and a 3-D FCN is exploited on the voxel level. Our model differs from conventional FCN in that it not only extends FCN by adding ConvLSTM on the pixel level for exploring long-term dependence, but also proposes 3-D FCN to enable voxel level prediction. On two benchmarks of A2D and CamVid, our DST-FCN achieves superior results to state-of-the-art techniques. More remarkably, we obtain to-date the best reported results: 45.0% per-label accuracy on A2D and 68.8% mean IoU on CamVid.

[1] Tao Mei,et al. Deep Quantization: Encoding Convolutional Activations with Deep Generative Model , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Hujun Bao,et al. Spatio-Temporal Video Segmentation of Static Scenes and Its Applications , 2015, IEEE Transactions on Multimedia.

[3] Roberto Cipolla,et al. Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[4] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5] Tao Mei,et al. Learning hierarchical video representation for action recognition , 2017, International Journal of Multimedia Information Retrieval.

[6] Xiaoxiao Li,et al. Semantic Image Segmentation via Deep Parsing Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[8] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9] Roberto Cipolla,et al. MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving , 2016, 2018 IEEE Intelligent Vehicles Symposium (IV).

[10] Ronan Collobert,et al. Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[11] Philip H. S. Torr,et al. Combining Appearance and Structure from Motion Features for Road Scene Understanding , 2009, BMVC.

[12] 한보형,et al. Learning Deconvolution Network for Semantic Segmentation , 2015 .

[13] Svetlana Lazebnik,et al. Superparsing - Scalable Nonparametric Image Parsing with Superpixels , 2010, International Journal of Computer Vision.

[14] Irfan A. Essa,et al. Geometric Context from Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Iasonas Kokkinos,et al. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[16] Charless C. Fowlkes,et al. Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation , 2016, ECCV.

[17] Xuming He,et al. Multiclass semantic video segmentation with object-level active inference , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19] Jason J. Corso,et al. Temporally consistent multi-class video-object segmentation with the Video Graph-Shifts algorithm , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[20] Vibhav Vineet,et al. Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21] Xiao-Ping Zhang,et al. Efficient Heuristic Methods for Multimodal Fusion and Concept Fusion in Video Concept Detection , 2015, IEEE Transactions on Multimedia.

[22] Charless C. Fowlkes,et al. Laplacian Reconstruction and Refinement for Semantic Segmentation , 2016, ArXiv.

[23] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24] Xuming He,et al. Learning Dynamic Hierarchical Models for Anytime Scene Labeling , 2016, ECCV.

[25] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[26] Tao Mei,et al. Super Fast Event Recognition in Internet Videos , 2015, IEEE Transactions on Multimedia.

[27] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28] Guosheng Lin,et al. Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Roberto Cipolla,et al. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30] Nicu Sebe,et al. Joint Graph Learning and Video Segmentation via Multiple Cues and Topology Calibration , 2016, ACM Multimedia.

[31] Tao Mei,et al. Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[32] Vladlen Koltun,et al. Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Chenliang Xu,et al. Streaming Hierarchical Video Segmentation , 2012, ECCV.

[34] Mei Han,et al. Efficient hierarchical graph-based video segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[35] Chenliang Xu,et al. Actor-Action Semantic Segmentation with Grouping Process Models , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Chong-Wah Ngo,et al. Video Event Detection Using Motion Relativity and Feature Selection , 2014, IEEE Transactions on Multimedia.

[37] Xiaochun Cao,et al. Fashion Parsing With Video Context , 2015, IEEE Trans. Multim..

[38] Chenliang Xu,et al. Can humans fly? Action understanding with multiple classes of actors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Ruigang Yang,et al. Semantic Segmentation of Urban Scenes Using Dense Depth Maps , 2010, ECCV.

[40] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] C. V. Jawahar,et al. Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[42] Camille Couprie,et al. Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43] Wei Liu,et al. ParseNet: Looking Wider to See Better , 2015, ArXiv.