Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks 1 place on CVPR2021 Referring Youtube-VOS challenge.

[1]  Chongruo Wu,et al.  ResNeSt: Split-Attention Networks , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  Jiaya Jia,et al.  Video Instance Segmentation with a Propose-Reduce Paradigm , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Chen Liang,et al.  ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation , 2021, ArXiv.

[4]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[5]  Qi Tian,et al.  Polar Relative Positional Encoding for Video-Language Segmentation , 2020, IJCAI.

[6]  Hao Wang,et al.  Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries , 2020, AAAI.

[7]  Yunchao Wei,et al.  Collaborative Video Object Segmentation by Foreground-Background Integration , 2020, ECCV.

[8]  Hao Chen,et al.  Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[9]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[10]  Bohyung Han,et al.  URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark , 2020, ECCV.

[11]  Cheng Deng,et al.  Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Kai Chen,et al.  Hybrid Task Cascade for Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Ning Xu,et al.  YouTube-VOS: Sequence-to-Sequence Video Object Segmentation , 2018, ECCV.

[16]  Bernt Schiele,et al.  Video Object Segmentation with Language Referring Expressions , 2018, ACCV.

[17]  Cees Snoek,et al.  Actor and Action Video Segmentation from a Sentence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[20]  Licheng Yu,et al.  Modeling Context in Referring Expressions , 2016, ECCV.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[26]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.