论文信息 - Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition

Efficient Spatio-Temporal Video Grounding with Semantic-Guided Feature Decomposition

Spatio-temporal video grounding (STVG) aims to localize the spatio-temporal object tube in a video according to a given text query. Current approaches address the STVG task with end-to-end frameworks while suffering from heavy computational complexity and insufficient spatio-temporal interactions. To overcome these limitations, we propose a novel Semantic-Guided Feature Decomposition based Network (SGFDN). A semantic-guided mapping operation is proposed to decompose the 3D spatio-temporal feature into 2D motions and 1D object embedding without losing much object-related semantic information. Thus, the computational complexity in computationally expensive operations such as attention mechanisms can be effectively reduced by replacing the input spatio-temporal feature with the decomposed features. Furthermore, based on this decomposition strategy, a pyramid relevance filtering based attention is proposed to capture the cross-modal interactions at multiple spatio-temporal scales. In addition, a decomposition-based grounding head is proposed to locate the queried objects with less computational complexity. Extensive experiments on two widely-used STVG datasets (VidSTG and HC-STVG) demonstrate that our method enjoys state-of-the-art performance as well as less computational complexity. The code has been available at https://github.com/TJUMMG/SGFDN.

Yuting Su | Weizhi Nie | Jing Liu | Weikang Wang

[1] Xiaokang Yang,et al. Sequence as a Whole: A Unified Framework for Video Action Localization With Long-Range Text Query , 2023, IEEE Transactions on Image Processing.

[2] Zhou Zhao,et al. HERO: HiErarchical spatio-tempoRal reasOning with Contrastive Action Correspondence for End-to-End Video Object Grounding , 2022, ACM Multimedia.

[3] Daizong Liu,et al. Reducing the Vision and Language Bias for Temporal Sentence Grounding , 2022, ACM Multimedia.

[4] Zi-Yi Dou,et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[5] Dong Xu,et al. STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6] Tianhao Li,et al. Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding , 2021, AAAI.

[7] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[8] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[10] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[12] Jiebo Luo,et al. Multi-Scale 2D Temporal Adjacency Networks for Moment Localization With Natural Language , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Xiaojie Jin,et al. Human-Centric Spatio-Temporal Video Grounding With Visual Transformers , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[14] Jiebo Luo,et al. Improving One-stage Visual Grounding by Recursive Sub-query Construction , 2020, ECCV.

[15] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[17] Licheng Yu,et al. Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[18] Runhao Zeng,et al. Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Kan Chen,et al. Video Object Grounding Using Semantic Roles in Language Description , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Liujuan Cao,et al. Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Zhou Zhao,et al. Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Jiebo Luo,et al. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[23] Cheng Deng,et al. Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[25] C. Qian,et al. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Zili Liu,et al. Training-Time-Friendly Network for Real-Time Object Detection , 2019, AAAI.

[27] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[28] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[29] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[30] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[31] Jiebo Luo,et al. Localizing Natural Language in Videos , 2019, AAAI.

[32] Lin Ma,et al. Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video , 2019, ACL.

[33] Yizhou Yu,et al. Cross-Modal Relationship Inference for Grounding Referring Expressions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Xingyi Zhou,et al. Objects as Points , 2019, ArXiv.

[35] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Xiaogang Wang,et al. Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Pablo Arbeláez,et al. Dynamic Multimodal Instance Segmentation guided by natural language queries , 2018, ECCV.

[38] Qi Wu,et al. Visual Grounding via Accumulated Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Cees G. M. Snoek,et al. Actor and Action Video Segmentation from a Sentence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[41] Liang Wang,et al. Referring Expression Generation and Comprehension via Attributes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[43] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Chenxi Liu,et al. Recurrent Multimodal Interaction for Referring Image Segmentation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45] Larry S. Davis,et al. Modeling Context Between Objects for Referring Expression Understanding , 2016, ECCV.

[46] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.