MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ~90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. The proposed MOSE dataset has been released at https://henghuiding.github.io/MOSE.

[1]  Jianping Shi,et al.  Improving Video Instance Segmentation via Temporal Pyramid Routing , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  David J. Crandall,et al.  A Survey on Deep Learning Technique for Video Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yi Yang,et al.  Decoupling Features in Hierarchical Propagation for Video Object Segmentation , 2022, Neural Information Processing Systems.

[4]  Chi-Keung Tang,et al.  Video Mask Transfiner for High-Quality Video Instance Segmentation , 2022, ECCV.

[5]  Thomas E. Huang,et al.  Tracking Every Thing in the Wild , 2022, ECCV.

[6]  Ho Kei Cheng,et al.  XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model , 2022, ECCV.

[7]  L. Shao,et al.  Multi-Level Representation Learning with Semantic Alignment for Referring Video Object Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Wenhao Jiang,et al.  SWEM: Towards Real-Time Video Object Segmentation with Sequential Weighted Expectation-Maximization , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Junshi Huang,et al.  Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation , 2022, Computer Vision and Pattern Recognition.

[10]  Yunchao Wei,et al.  Large-scale Video Panoptic Segmentation in the Wild: A Benchmark , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Zhiwei Xiong,et al.  Recurrent Dynamic Embedding for Video Object Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Xudong Jiang,et al.  Instance-Specific Feature Propagation for Referring Segmentation , 2022, IEEE Transactions on Multimedia.

[13]  L. Gool,et al.  Coarse-to-Fine Feature Mining for Video Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Xudong Jiang,et al.  Deep Interactive Image Matting With Feature Propagation , 2022, IEEE Transactions on Image Processing.

[15]  Evgenii Zheltonozhskii,et al.  End-to-End Referring Video Object Segmentation with Multimodal Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  David J. Crandall,et al.  Segmenting Objects From Relational Visual Data , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Si Liu,et al.  Cross-Modal Progressive Comprehension for Referring Segmentation , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Philip H. S. Torr,et al.  Occluded Video Instance Segmentation: A Benchmark , 2021, International Journal of Computer Vision.

[19]  Yang Wang,et al.  Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jun Liu,et al.  Interaction via Bi-directional Graph of Semantic Region Affinity for Scene Parsing , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Henghui Ding,et al.  Prototypical Matching and Open Set Rejection for Zero-Shot Semantic Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Xudong Jiang,et al.  Vision-Language Transformer and Query Generation for Referring Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Roger Zimmermann,et al.  Recovering the Unbiased Scene Graphs from the Biased Ones , 2021, ACM Multimedia.

[24]  Chi-Keung Tang,et al.  Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation , 2021, NeurIPS.

[25]  Yi Yang,et al.  Associating Objects with Transformers for Video Object Segmentation , 2021, NeurIPS.

[26]  Wenguan Wang,et al.  Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation , 2021, ArXiv.

[27]  Guoqiang Han,et al.  Reciprocal Transformations for Unsupervised Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Stefano Soatto,et al.  DyStaB: Unsupervised Object Segmentation via Dynamic-Static Bootstrapping* , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  In So Kweon,et al.  Learning to Associate Every Segment for Video Panoptic Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jiaxu Miao,et al.  VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yongdong Zhang,et al.  Query-Memory Re-Aggregation for Weakly-supervised Video Object Segmentation , 2021, AAAI.

[32]  Yeong Jun Koh,et al.  Guided Interactive Video Object Segmentation Using Reliability-Based Attention Maps , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Rong Jin,et al.  Learning Position and Target Consistency for Memory-based Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Chi-Keung Tang,et al.  Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Shenghua Gao,et al.  Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ho Kei Cheng,et al.  Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Parham Aarabi,et al.  SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Ganesh Venkatesh,et al.  Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised Video Object Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Alan Yuille,et al.  ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Dongdong Yu,et al.  F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation , 2020, AAAI.

[41]  Song Bai,et al.  Occluded Video Instance Segmentation , 2021, ArXiv.

[42]  Sanja Fidler,et al.  ScribbleBox: Interactive Annotation Framework for Video Object Segmentation , 2020, ECCV.

[43]  Chang-Su Kim,et al.  Interactive Video Object Segmentation Using Global and Local Transfer Modules , 2020, ECCV.

[44]  Luc Van Gool,et al.  Video Object Segmentation with Episodic Graph Memory Networks , 2020, ECCV.

[45]  Bin Liu,et al.  GSM: Graph Similarity Model for Multi-Object Tracking , 2020, IJCAI.

[46]  Alan Yuille,et al.  Compositional Convolutional Neural Networks: A Robust and Interpretable Model for Object Recognition Under Occlusion , 2020, International Journal of Computer Vision.

[47]  In So Kweon,et al.  Video Panoptic Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Chi-Keung Tang,et al.  Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Mingjie Sun,et al.  Fast Template Matching and Update for Video Object Tracking and Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Stephen Lin,et al.  A Transductive Approach for Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Bo Dai,et al.  Self-Supervised Scene De-Occlusion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yunchao Wei,et al.  Memory Aggregation Networks for Efficient Interactive Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Laura Leal-Taixé,et al.  STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos , 2020, ECCV.

[54]  Yunchao Wei,et al.  Collaborative Video Object Segmentation by Foreground-Background Integration , 2020, ECCV.

[55]  Steven C. H. Hoi,et al.  Learning Video Object Segmentation From Unlabeled Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Ling Shao,et al.  Motion-Attentive Transition for Zero-Shot Video Object Segmentation , 2020, AAAI.

[57]  Gang Yu,et al.  State-Aware Tracker for Real-Time Video Object Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Zhuowen Tu,et al.  Learning Instance Occlusion for Panoptic Segmentation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Gang Wang,et al.  Feature Boosting Network For 3D Pose Estimation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Bohyung Han,et al.  URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark , 2020, ECCV.

[61]  Xudong Jiang,et al.  PhraseClick: Toward Achieving Flexible Interactive Segmentation by Phrase and Click , 2020, ECCV.

[62]  Ling Shao,et al.  Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  Xiaojuan Qi,et al.  AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Cheng Deng,et al.  Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Zhe L. Lin,et al.  Fast Video Object Segmentation via Dynamic Targeting Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[66]  Luca Bertinetto,et al.  Anchor Diffusion for Unsupervised Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[67]  Mubarak Shah,et al.  CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Yizhou Yu,et al.  Motion Guided Attention for Video Salient Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Gang Wang,et al.  Boundary-Aware Feature Propagation for Scene Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Ling Shao,et al.  RANet: Ranking Attention Network for Fast Video Object Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Ling Shao,et al.  See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Xudong Jiang,et al.  Semantic Correlation Promoted Shape-Variant Context for Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Sanyuan Zhao,et al.  Learning Unsupervised Video Object Segmentation Through Visual Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Yuchen Fan,et al.  Video Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[75]  Luc Van Gool,et al.  The 2019 DAVIS Challenge on VOS: Unsupervised Multi-Object Segmentation , 2019, ArXiv.

[76]  Yue Cao,et al.  Spatial-Temporal Relation Networks for Multi-Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[77]  Ning Xu,et al.  Fast User-Guided Video Object Segmentation by Interaction-And-Propagation Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Wei Liu,et al.  MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[80]  Miriam Bellver,et al.  RVOS: End-To-End Recurrent Network for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Gang Wang,et al.  Toward Achieving Robust Low-Level and High-Level Scene Parsing , 2019, IEEE Transactions on Image Processing.

[82]  Bastian Leibe,et al.  FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[83]  Qiang Wang,et al.  Fast Online Object Tracking and Segmentation: A Unifying Approach , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Hua Yang,et al.  Online Multi-Object Tracking with Dual Matching Attention Networks , 2018, ECCV.

[85]  Ning Xu,et al.  YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[86]  Shifeng Zhang,et al.  Occlusion-aware R-CNN: Detecting Pedestrians in a Crowd , 2018, ECCV.

[87]  Guosheng Lin,et al.  MoNet: Deep Motion Exploitation for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[88]  Kalyan Sunkavalli,et al.  Fast Video Object Segmentation by Reference-Guided Mask Propagation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[89]  Gang Wang,et al.  Context Contrasted Feature and Gated Multi-scale Aggregation for Scene Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[90]  Gang Wang,et al.  Motion-Guided Cascaded Refinement Network for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[91]  Yuan Xie,et al.  Flow Guided Recurrent Neural Encoder for Video Salient Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[92]  Xiaojun Chang,et al.  Reinforcement Cutting-Agent Learning for Video Object Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[93]  Ming-Hsuan Yang,et al.  Fast and Accurate Online Video Object Segmentation via Tracking Parts , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[94]  Luc Van Gool,et al.  Blazingly Fast Video Object Segmentation with Pixel-Wise Metric Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[95]  Karteek Alahari,et al.  Learning to Segment Moving Objects , 2017, International Journal of Computer Vision.

[96]  Yuning Jiang,et al.  Repulsion Loss: Detecting Pedestrians in a Crowd , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[97]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[98]  Ming-Hsuan Yang,et al.  SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[99]  Jun-Sik Kim,et al.  Pixel-Level Matching for Video Object Segmentation Using Convolutional Neural Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[100]  Nenghai Yu,et al.  Online Multi-object Tracking Using CNN-Based Single Object Tracker with Spatial-Temporal Attention Mechanism , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[101]  Chang-Su Kim,et al.  Online Video Object Segmentation via Convolutional Trident Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  Karteek Alahari,et al.  Learning Video Object Segmentation with Visual Memory , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[103]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[104]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Peter V. Gehler,et al.  Video Propagation Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[106]  Bernt Schiele,et al.  Learning Video Object Segmentation from Static Images , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[107]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[110]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[111]  Dani Lischinski,et al.  JumpCut , 2015, ACM Trans. Graph..

[112]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[113]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[114]  Jitendra Malik,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[115]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[116]  James M. Rehg,et al.  Video Segmentation by Tracking Many Figure-Ground Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[117]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[118]  Roberto Cipolla,et al.  Semantic object classes in video: A high-definition ground truth database , 2009, Pattern Recognit. Lett..

[119]  Jitendra Malik,et al.  Learning to detect natural image boundaries using local brightness, color, and texture cues , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.