论文信息 - Semantic Video CNNs Through Representation Warping

Semantic Video CNNs Through Representation Warping

In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very lit- tle extra computational cost. This module is called Net- Warp and we demonstrate its use for a range of network architectures. The main design principle is to use opti- cal flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the CamVid and Cityscapes benchmark datasets and show consistent im- provements over different baseline networks. Our code and models are available at http://segmentation.is.tue.mpg.de

[1] Vladlen Koltun,et al. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[2] Luc Van Gool,et al. On-line semantic perception using uncertainty , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3] Maneesh Agrawala,et al. Interactive video cutout , 2005, SIGGRAPH 2005.

[4] Luc Van Gool,et al. Fast Optical Flow Using Dense Inverse Search , 2016, ECCV.

[5] Peter V. Gehler,et al. Learning Sparse High Dimensional Filters: Image Filtering, Dense CRFs and Bilateral Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Ian D. Reid,et al. RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Didier Stricker,et al. Flow Fields: Dense Correspondence Fields for Highly Accurate Large Displacement Optical Flow Estimation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[10] Bodo Rosenhahn,et al. Interactive Segmentation of High-Resolution Video Content Using Temporally Coherent Superpixels and Graph Cut , 2014, ISVC.

[11] Roberto Cipolla,et al. Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[12] Peter V. Gehler,et al. Superpixel Convolutional Networks Using Bilateral Inceptions , 2015, ECCV.

[13] Peter V. Gehler,et al. Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] James M. Rehg,et al. Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.

[15] Stefan Roth,et al. Joint Optical Flow and Temporally Consistent Semantic Segmentation , 2016, ECCV Workshops.

[16] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17] David Salesin,et al. Video matting of complex scenes , 2002, SIGGRAPH.

[18] Truong Q. Nguyen,et al. Semantic video segmentation: Exploring inference efficiency , 2015, 2015 International SoC Design Conference (ISOCC).

[19] Xiaogang Wang,et al. Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] M. Hebert,et al. Efficient temporal consistency for streaming video scene analysis , 2013, 2013 IEEE International Conference on Robotics and Automation.

[21] Vladlen Koltun,et al. Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[22] Pushmeet Kohli,et al. Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Anton van den Hengel,et al. Wider or Deeper: Revisiting the ResNet Model for Visual Recognition , 2016, Pattern Recognit..

[24] Vladlen Koltun,et al. Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Jason J. Corso,et al. Temporally consistent multi-class video-object segmentation with the Video Graph-Shifts algorithm , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[27] SzeliskiRichard,et al. Video matting of complex scenes , 2002 .

[28] C. V. Jawahar,et al. Scene Text Recognition using Higher Order Language Priors , 2009, BMVC.

[29] Jian Sun,et al. Video object cut and paste , 2005, SIGGRAPH 2005.

[30] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Garrison W. Cottrell,et al. Understanding Convolution for Semantic Segmentation , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32] Dani Lischinski,et al. JumpCut , 2015, ACM Trans. Graph..

[33] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Bastian Leibe,et al. Joint 2D-3D temporally consistent semantic segmentation of street scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[35] Scott Cohen,et al. LIVEcut: Learning-based interactive video segmentation by evaluation of multiple propagated cues , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36] Kristen Grauman,et al. Supervoxel-Consistent Foreground Propagation in Video , 2014, ECCV.

[37] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[38] Philip H. S. Torr,et al. Combining Appearance and Structure from Motion Features for Road Scene Understanding , 2009, BMVC.

[39] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40] Xuming He,et al. Multi-class Semantic Video Segmentation with Exemplar-Based Object Reasoning , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[41] Ali Shahrokni,et al. Urban 3D semantic modelling using stereo vision , 2013, 2013 IEEE International Conference on Robotics and Automation.

[42] Iasonas Kokkinos,et al. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[43] Charless C. Fowlkes,et al. Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation , 2016, ECCV.

[44] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[45] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46] Trevor Darrell,et al. Clockwork Convnets for Video Semantic Segmentation , 2016, ECCV Workshops.

[47] Luc Van Gool,et al. Segmentation-Based Urban Traffic Scene Understanding , 2009, BMVC.

[48] Roberto Cipolla,et al. Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[49] Sinisa Todorovic,et al. Recurrent Temporal Deep Field for Semantic Video Labeling , 2016, ECCV.