ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel topdown approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, textguided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Besides, based on the inspiration, we win 1st place in CVPR2021 Referring Youtube-VOS challenge.

[1]  Arnold W. M. Smeulders,et al.  Tracking by Natural Language Specification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Chenliang Xu,et al.  Can humans fly? Action understanding with multiple classes of actors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[4]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Cheng Deng,et al.  Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[7]  Kan Chen,et al.  Video Object Grounding Using Semantic Roles in Language Description , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Wei Zhang,et al.  Segment as Points for Efficient Online Multi-Object Tracking and Segmentation , 2020, ECCV.

[9]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  John D E Gabrieli,et al.  Neural correlates of actual and predicted memory formation , 2005, Nature Neuroscience.

[11]  Cees Snoek,et al.  Actor and Action Video Segmentation from a Sentence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[13]  Chenxi Liu,et al.  Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Heng Tao Shen,et al.  Video Captioning With Attention-Based LSTM and Semantic Consistency , 2017, IEEE Transactions on Multimedia.

[15]  Yunchao Wei,et al.  Referring Image Segmentation via Cross-Modal Progressive Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[17]  Yi Yang,et al.  Decoupled Novel Object Captioner , 2018, ACM Multimedia.

[18]  Yang Zhao,et al.  Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xin Wang,et al.  Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Yunchao Wei,et al.  Referring Image Segmentation by Generative Adversarial Learning , 2020, IEEE Transactions on Multimedia.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Jiebo Luo,et al.  A Fast and Accurate One-Stage Approach to Visual Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Licheng Yu,et al.  MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Jiebo Luo,et al.  Grounding-Tracking-Integration , 2019, ArXiv.

[25]  Leonid Sigal,et al.  G3raphGround: Graph-Based Language Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[27]  Hwann-Tzong Chen,et al.  See-Through-Text Grouping for Referring Image Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Huchuan Lu,et al.  Bi-Directional Relationship Inferring Network for Referring Image Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Mubarak Shah,et al.  Visual-Textual Capsule Routing for Text-Based Video Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[32]  Subhransu Maji,et al.  PhraseCut: Language-Based Image Segmentation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Hao Wang,et al.  Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries , 2020, AAAI.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Pablo Arbeláez,et al.  Dynamic Multimodal Instance Segmentation guided by natural language queries , 2018, ECCV.

[37]  G. T. Buswell How People Look At Pictures: A Study Of The Psychology Of Perception In Art , 2012 .

[38]  Chunhua Shen,et al.  Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[39]  Xiaojun Chang,et al.  Vision-Dialog Navigation by Exploring Cross-Modal Memory , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ming-Hsuan Yang,et al.  Referring Expression Object Segmentation with Caption-Aware Consistency , 2019, BMVC.

[41]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[42]  Yan Yan,et al.  Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  J. Wolfe,et al.  Five factors that guide attention in visual search , 2017, Nature Human Behaviour.

[44]  Yang Wang,et al.  Cross-Modal Self-Attention Network for Referring Image Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Guanbin Li,et al.  Linguistic Structure Guided Context Modeling for Referring Image Segmentation , 2020, ECCV.

[46]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[47]  Trevor Darrell,et al.  Segmentation from Natural Language Expressions , 2016, ECCV.

[48]  Y. Trope,et al.  Person-centered cognition: The presence of people in a visual scene promotes relational reasoning , 2020 .

[49]  Qi Tian,et al.  Polar Relative Positional Encoding for Video-Language Segmentation , 2020, IJCAI.

[50]  Taylor R. Hayes,et al.  Meaning-based guidance of attention in scenes as revealed by meaning maps , 2017, Nature Human Behaviour.