论文信息 - SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

In this paper we introduce a Transformer-based approach to video object segmentation (VOS). To address compounding error and scalability issues of prior work, we propose a scalable, end-to-end method for VOS called Sparse Spatiotemporal Transformers (SST). SST extracts per-pixel representations for each object in a video using sparse attention over spatiotemporal features. Our attention-based formulation for VOS allows a model to learn to attend over a history of multiple frames and provides suitable inductive bias for performing correspondence-like computations necessary for solving motion segmentation. We demonstrate the effectiveness of attention-based over recurrent networks in the spatiotemporal domain. Our method achieves competitive results on YouTube-VOS and DAVIS 2017 with improved scalability and robustness to occlusions compared with the state of the art. Code is available at https://github.com/dukebw/SSTVOS.

Parham Aarabi | Christian Wolf | Graham W. Taylor | Brendan Duke | Abdalla Ahmed

[1] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Bastian Leibe,et al. FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Ming-Hsuan Yang,et al. SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[5] Xiaoxiao Li,et al. Video Object Segmentation with Joint Re-identification and Attention-Aware Mask Propagation , 2018, ECCV.

[6] Michael Felsberg,et al. A Generative Appearance Model for End-To-End Video Object Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Karteek Alahari,et al. Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[9] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[10] Euntai Kim,et al. Kernelized Memory Network for Video Object Segmentation , 2020, ECCV.

[11] Vladlen Koltun,et al. Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[13] Miriam Bellver,et al. Recurrent Neural Networks for Semantic Instance Segmentation , 2017, ArXiv.

[14] Luc Van Gool,et al. The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[15] Kristen Grauman,et al. FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Andrew Zisserman,et al. Video Action Transformer Network , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Dhananjaya N. Gowda,et al. Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System , 2019, INTERSPEECH.

[18] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19] Hermann Ney,et al. RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation , 2019, INTERSPEECH.

[20] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[21] Alexander G. Schwing,et al. VideoMatch: Matching based Video Object Segmentation , 2018, ECCV.

[22] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[23] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24] Ning Xu,et al. YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[25] Yunchao Wei,et al. CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[27] Kalyan Sunkavalli,et al. Fast Video Object Segmentation by Reference-Guided Mask Propagation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28] Christian Wolf,et al. How Transferable are Reasoning Patterns in VQA? , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Luc Van Gool,et al. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Peter V. Gehler,et al. Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Yunchao Wei,et al. Collaborative Video Object Segmentation by Foreground-Background Integration , 2020, ECCV.

[32] K.-K. Maninis,et al. Video Object Segmentation without Temporal Information , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Bastian Leibe,et al. Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[34] Michael J. Black,et al. Video Segmentation via Object Flow , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Trevor Darrell,et al. Do Convnets Learn Correspondence? , 2014, NIPS.

[36] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[37] Alexander G. Schwing,et al. MaskRNN: Instance Level Video Object Segmentation , 2018, NIPS.

[38] Xiaojuan Qi,et al. ICNet for Real-Time Semantic Segmentation on High-Resolution Images , 2017, ECCV.

[39] Bernt Schiele,et al. Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Ning Xu,et al. YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[41] Wei Liu,et al. CNN in MRF: Video Object Segmentation via Inference in a CNN-Based Higher-Order Spatio-Temporal MRF , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Ming-Hsuan Yang,et al. Fast and Accurate Online Video Object Segmentation via Tracking Parts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Aggelos K. Katsaggelos,et al. Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Bastian Leibe,et al. PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation , 2018, ACCV.

[45] Ning Xu,et al. Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46] Chang-Su Kim,et al. Online Video Object Segmentation via Convolutional Trident Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[48] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[49] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50] Bastian Leibe,et al. BoLTVOS: Box-Level Tracking for Video Object Segmentation , 2019, ArXiv.

[51] Gang Wang,et al. Motion-Guided Cascaded Refinement Network for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.