Exploring Motion and Appearance Information for Temporal Sentence Grounding

This paper addresses temporal sentence grounding. Previous works typically solve this task by learning frame-level video features and align them with the textual information. A major limitation of these works is that they fail to distinguish ambiguous video frames with subtle appearance differences due to frame-level feature extraction. Recently, a few methods adopt Faster R-CNN to extract detailed object features in each frame to differentiate the fine-grained appearance similarities. However, the object-level features extracted by Faster R-CNN suffer from missing motion analysis since the object detection model lacks temporal modeling. To solve this issue, we propose a novel Motion-Appearance Reasoning Network (MARN), which incorporates both motion-aware and appearance-aware object features to better reason object relations for modeling the activity among successive frames. Specifically, we first introduce two individual video encoders to embed the video into corresponding motion-oriented and appearance-aspect object representations. Then, we develop separate motion and appearance branches to learn motionguided and appearance-guided object relations, respectively. At last, both motion and appearance information from two branches are associated to generate more representative features for final grounding. Extensive experiments on two challenging datasets (Charades-STA and TACoS) show that our proposed MARN significantly outperforms previous state-ofthe-art methods by a large margin.

[1]  Tao Mei,et al.  Structured Two-Stream Attention Network for Video Question Answering , 2019, AAAI.

[2]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[3]  Ahjeong Seo,et al.  Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering , 2021, ACL.

[4]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yu Cheng,et al.  Fine-grained Iterative Attention Network for Temporal Language Localization in Videos , 2020, ACM Multimedia.

[7]  Jian Shao,et al.  Boundary Proposal Network for Two-Stage Natural Language Video Localization , 2021, AAAI.

[8]  Xiao-Ming Wu,et al.  Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning , 2018, AAAI.

[9]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Zhou Zhao,et al.  Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jingwen Wang,et al.  Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2020, AAAI.

[12]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[13]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[14]  Yitian Yuan,et al.  Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Zhijie Lin,et al.  Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding , 2020, IJCAI.

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[20]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[21]  Jianfeng Dong,et al.  Context-aware Biaffine Localizing Network for Temporal Sentence Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Xiaoye Qu,et al.  Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos , 2021, EMNLP.

[23]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[24]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Pan Zhou,et al.  Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network , 2020, COLING.

[26]  Xiao-Yang Liu,et al.  Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization , 2020, ACM Multimedia.

[27]  Hao Zhang,et al.  Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.

[28]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[29]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[31]  Yang Zhao,et al.  Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Xiaoye Qu,et al.  Memory-Guided Semantic Learning Network for Temporal Sentence Grounding , 2022, AAAI.

[34]  Xiaoye Qu,et al.  Unsupervised Temporal Video Grounding with Deep Semantic Clustering , 2022, AAAI.

[35]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[36]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Bohyung Han,et al.  Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jiebo Luo,et al.  Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[40]  Christoph Feichtenhofer,et al.  X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Runhao Zeng,et al.  Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Hongdong Li,et al.  Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43]  Truyen Tran,et al.  Hierarchical Conditional Relation Networks for Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[45]  Rui Qiao,et al.  Interventional Video Grounding with Dual Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Xiaoye Qu,et al.  Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding , 2021, EMNLP.

[48]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  Long Chen,et al.  Rethinking the Bottom-Up Framework for Query-Based Video Localization , 2020, AAAI.