论文信息 - Local Memory Attention for Fast Video Semantic Segmentation

Local Memory Attention for Fast Video Semantic Segmentation

We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline. In contrast to prior works, we strive towards a simple, fast, and general module that can be integrated into virtually any single-frame architecture. Our approach aggregates a rich representation of the semantic information in past frames into a memory module. Information stored in the memory is then accessed through an attention mechanism. In contrast to previous memory-based approaches, we propose a fast local attention layer, providing temporal appearance cues in the local region of prior frames. We further fuse these cues with an encoding of the current frame through a second attention-based module. The segmentation decoder processes the fused representation to predict the final semantic segmentation. We integrate our approach into two popular semantic segmentation networks: ERFNet and PSPNet. We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms. Source code is available at https://github.com/mattpfr/lmanet.

[1] Roberto Cipolla,et al. Segmentation and Recognition Using Structure from Motion Point Clouds , 2008, ECCV.

[2] Luc Van Gool,et al. On-line semantic perception using uncertainty , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[3] Roberto Cipolla,et al. Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[4] Dahua Lin,et al. Low-Latency Video Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] George Papandreou,et al. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[6] Vladlen Koltun,et al. Feature Space Optimization for Semantic Video Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Luc Van Gool,et al. Efficient Video Semantic Segmentation with Labels Propagation and Refinement , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[8] Ning Xu,et al. Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Eduardo Romera,et al. ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation , 2018, IEEE Transactions on Intelligent Transportation Systems.

[10] Truong Q. Nguyen,et al. Semantic video segmentation: Exploring inference efficiency , 2015, 2015 International SoC Design Conference (ISOCC).

[11] Bastian Leibe,et al. Joint 2D-3D temporally consistent semantic segmentation of street scenes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Zhe L. Lin,et al. Temporally Distributed Networks for Fast Video Semantic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Xiaojuan Qi,et al. ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[15] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16] Ivan Laptev,et al. Track to the future: Spatio-temporal video segmentation with long-range motion cues , 2011, CVPR 2011.

[17] Vladlen Koltun,et al. Playing for Data: Ground Truth from Computer Games , 2016, ECCV.

[18] Stephen Lin,et al. Local Relation Networks for Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Jun Fu,et al. Attention-Guided Network for Semantic Video Segmentation , 2019, IEEE Access.

[20] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[21] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Jitendra Malik,et al. Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[23] Yichen Wei,et al. Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Chun-Yi Lee,et al. Dynamic Video Segmentation Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25] Xiaogang Wang,et al. Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Trevor Darrell,et al. Clockwork Convnets for Video Semantic Segmentation , 2016, ECCV Workshops.

[27] Luc Van Gool,et al. Fast Optical Flow Using Dense Inverse Search , 2016, ECCV.

[28] Jason J. Corso,et al. Temporally consistent multi-class video-object segmentation with the Video Graph-Shifts algorithm , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[29] Chunhua Shen,et al. Efficient Semantic Video Segmentation with Per-frame Inference , 2020, ECCV.

[30] Alan Fern,et al. Budget-Aware Deep Semantic Video Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Thomas Brox,et al. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Karteek Alahari,et al. Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33] Andreas Geiger,et al. Vision meets robotics: The KITTI dataset , 2013, Int. J. Robotics Res..

[34] Michael J. Black,et al. A Naturalistic Open Source Movie for Optical Flow Evaluation , 2012, ECCV.

[35] Cristian Sminchisescu,et al. Semantic Video Segmentation by Gated Recurrent Flow Propagation , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Peter V. Gehler,et al. Semantic Video CNNs Through Representation Warping , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[38] Kate Saenko,et al. Real-Time Semantic Segmentation With Fast Attention , 2020, IEEE Robotics and Automation Letters.

[39] James M. Rehg,et al. Joint Semantic Segmentation and 3D Reconstruction from Monocular Video , 2014, ECCV.