论文信息 - SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art.

[1] Trevor Darrell,et al. Do Convnets Learn Correspondence? , 2014, NIPS.

[2] Ning Xu,et al. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[3] Ning Xu,et al. Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Yunchao Wei,et al. CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[6] Xiaoxiao Li,et al. Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation , 2018, ECCV.

[7] Kalyan Sunkavalli,et al. Fast Video Object Segmentation by Reference-Guided Mask Propagation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8] Bernt Schiele,et al. Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Alexander G. Schwing,et al. VideoMatch: Matching based Video Object Segmentation , 2018, ECCV.

[10] Dhananjaya N. Gowda,et al. Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System , 2019, INTERSPEECH.

[11] K.-K. Maninis,et al. Video Object Segmentation without Temporal Information , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13] Karteek Alahari,et al. Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14] Luc Van Gool,et al. The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[15] Ning Xu,et al. YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[16] Ming-Hsuan Yang,et al. Fast and Accurate Online Video Object Segmentation via Tracking Parts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17] Peter V. Gehler,et al. Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Vladlen Koltun,et al. Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20] Michael Felsberg,et al. A Generative Appearance Model for End-To-End Video Object Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[22] Bastian Leibe,et al. Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[23] Bastian Leibe,et al. BoLTVOS: Box-Level Tracking for Video Object Segmentation , 2019, ArXiv.

[24] Wei Liu,et al. CNN in MRF: Video Object Segmentation via Inference in a CNN-Based Higher-Order Spatio-Temporal MRF , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25] Bastian Leibe,et al. FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[27] Xiaojuan Qi,et al. ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[28] Ming-Hsuan Yang,et al. SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[30] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[31] Yunchao Wei,et al. Collaborative Video Object Segmentation by Foreground-Background Integration , 2020, ECCV.

[32] Chang-Su Kim,et al. Online Video Object Segmentation via Convolutional Trident Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ArXiv.

[34] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[35] Euntai Kim,et al. Kernelized Memory Network for Video Object Segmentation , 2020, ECCV.

[36] Miriam Bellver,et al. Recurrent Neural Networks for Semantic Instance Segmentation , 2017, ArXiv.

[37] Kristen Grauman,et al. FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Michael J. Black,et al. Video Segmentation via Object Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Luc Van Gool,et al. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Alexander G. Schwing,et al. MaskRNN: Instance Level Video Object Segmentation , 2018, NIPS.

[42] Aggelos K. Katsaggelos,et al. Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Gang Wang,et al. Motion-Guided Cascaded Refinement Network for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Bastian Leibe,et al. PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation , 2018, ACCV.

[46] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[47] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48] Hermann Ney,et al. RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.