Associating Objects with Transformers for Video Object Segmentation

This paper investigates how to realize better and more efficient embedding learning to tackle the semi-supervised video object segmentation under challenging multi-object scenarios. The state-of-the-art methods learn to decode features with a single positive object and thus have to match and segment each target separately under multi-object scenarios, consuming multiple times computing resources. To solve the problem, we propose an Associating Objects with Transformers (AOT) approach to match and decode multiple objects uniformly. In detail, AOT employs an identification mechanism to associate multiple targets into the same high-dimensional embedding space. Thus, we can simultaneously process multiple objects’ matching and segmentation decoding as efficiently as processing a single object. For sufficiently modeling multi-object association, a Long Short-Term Transformer is designed for constructing hierarchical matching and propagation. We conduct extensive experiments on both multi-object and single-object benchmarks to examine AOT variant networks with different complexities. Particularly, our R50-AOT-L outperforms all the state-of-the-art competitors on three popular benchmarks, i.e., YouTube-VOS (84.1% J&F), DAVIS 2017 (84.9%), and DAVIS 2016 (91.1%), while keeping more than 3× faster multi-object run-time. Meanwhile, our AOT-T can maintain real-time multi-object speed on the above benchmarks. Based on AOT, we ranked 1 in the 3rd Large-scale VOS Challenge.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Michael Felsberg,et al.  Learning What to Learn for Video Object Segmentation , 2020, ECCV.

[4]  R. Venkatesh Babu,et al.  SeamSeg: Video Object Segmentation Using Patch Seams , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[6]  Xiaojuan Qi,et al.  AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Stefan Roth,et al.  MOT16: A Benchmark for Multi-Object Tracking , 2016, ArXiv.

[8]  Sebastian Nowozin,et al.  Optimal Decisions from Probabilistic Models: The Intersection-over-Union Case , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Ning Xu,et al.  YouTube-VOS: A Large-Scale Video Object Segmentation Benchmark , 2018, ArXiv.

[10]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yu Li,et al.  Fast Video Object Segmentation using the Global Context Module , 2020, ECCV.

[12]  Aggelos K. Katsaggelos,et al.  Efficient Video Object Segmentation via Network Modulation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Bastian Leibe,et al.  PReMVOS: Proposal-generation, Refinement and Merging for Video Object Segmentation , 2018, ACCV.

[14]  Gang Yu,et al.  State-Aware Tracker for Real-Time Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[16]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[17]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[18]  Ning Xu,et al.  Fast User-Guided Video Object Segmentation by Interaction-And-Propagation Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Wei Zhang,et al.  Segment as Points for Efficient Online Multi-Object Tracking and Segmentation , 2020, ECCV.

[20]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Alexander G. Schwing,et al.  VideoMatch: Matching based Video Object Segmentation , 2018, ECCV.

[22]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[23]  Yunchao Wei,et al.  Collaborative Video Object Segmentation by Foreground-Background Integration , 2020, ECCV.

[24]  Edouard Grave,et al.  End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures , 2019, ArXiv.

[25]  Guosheng Lin,et al.  MoNet: Deep Motion Exploitation for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Bastian Leibe,et al.  BoLTVOS: Box-Level Tracking for Video Object Segmentation , 2019, ArXiv.

[27]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[28]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[29]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[30]  Roberto Cipolla,et al.  Label propagation in video sequences , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[31]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[32]  Michael Felsberg,et al.  A Generative Appearance Model for End-To-End Video Object Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jonathon Shlens,et al.  Scaling Local Self-Attention for Parameter Efficient Visual Backbones , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Kalyan Sunkavalli,et al.  Fast Video Object Segmentation by Reference-Guided Mask Propagation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[39]  Shengjin Wang,et al.  Towards Real-Time Multi-Object Tracking , 2019, ECCV.

[40]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[42]  Li Xu,et al.  Hierarchical Image Saliency Detection on Extended CSSD , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[44]  Kristen Grauman,et al.  Active Frame Selection for Label Propagation in Videos , 2012, ECCV.

[45]  Euntai Kim,et al.  Kernelized Memory Network for Video Object Segmentation , 2020, ECCV.

[46]  Yunchao Wei,et al.  Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration , 2021, IEEE transactions on pattern analysis and machine intelligence.

[47]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[48]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[49]  Bastian Leibe,et al.  Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[50]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[51]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[52]  Gedas Bertasius,et al.  Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[54]  Yuchen Fan,et al.  Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[56]  Sanja Fidler,et al.  Instance-Level Segmentation for Autonomous Driving with Deep Densely Connected MRFs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Bastian Leibe,et al.  FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  L. Gool,et al.  Video Object Segmentation with Episodic Graph Memory Networks , 2020, ECCV.

[60]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Luc Van Gool,et al.  Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Parham Aarabi,et al.  SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yunchao Wei,et al.  Memory Aggregation Networks for Efficient Interactive Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  King Ngi Ngan,et al.  Video Segmentation and Its Applications , 2011 .

[65]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Dustin Tran,et al.  Image Transformer , 2018, ICML.