Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos

This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.

[1]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[4]  Ying Wu,et al.  Discriminative subvolume search for efficient action detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Maja Pantic,et al.  Spatiotemporal Localization and Categorization of Human Actions in Unsegmented Image Sequences , 2011, IEEE Transactions on Image Processing.

[6]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[7]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[8]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[12]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[13]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[15]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[18]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[19]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[20]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Diego Marcheggiani,et al.  Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling , 2017, EMNLP.

[23]  Alex Fout,et al.  Protein Interface Prediction using Graph Convolutional Networks , 2017, NIPS.

[24]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[27]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[28]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.

[29]  Ruslan Salakhutdinov,et al.  Gated-Attention Architectures for Task-Oriented Language Grounding , 2017, AAAI.

[30]  Wenjun Zeng,et al.  Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[31]  Jian Yang,et al.  Action-Attending Graphic Neural Network , 2017, IEEE Transactions on Image Processing.

[32]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[33]  Leonid Sigal,et al.  G3raphGround: Graph-Based Language Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Liang Wang,et al.  Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ramakant Nevatia,et al.  MAC: Mining Activity Concepts for Language-Based Temporal Localization , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[36]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[37]  Lei Chen,et al.  Object Grounding via Iterative Context Reasoning , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[38]  Shiliang Pu,et al.  Video Relation Detection with Spatio-Temporal Graph , 2019, ACM Multimedia.

[39]  Bin Jiang,et al.  Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention , 2019, ICMR.

[40]  Yuan Luo,et al.  Graph Convolutional Networks for Text Classification , 2018, AAAI.

[41]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Xiao Liu,et al.  Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos , 2019, AAAI.

[43]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[44]  Yu-Gang Jiang,et al.  Semantic Proposal for Activity Localization in Videos via Sentence Query , 2019, AAAI.

[45]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yu Cheng,et al.  Relation-Aware Graph Attention Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Jiebo Luo,et al.  Localizing Natural Language in Videos , 2019, AAAI.

[48]  Alexander G. Hauptmann,et al.  ExCL: Extractive Clip Localization Using Natural Language Descriptions , 2019, NAACL.

[49]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[50]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[51]  Runhao Zeng,et al.  Graph Convolutional Networks for Temporal Action Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Runhao Zeng,et al.  Breaking Winner-Takes-All: Iterative-Winners-Out Networks for Weakly Supervised Temporal Action Localization , 2019, IEEE Transactions on Image Processing.

[53]  Le Yang,et al.  Revisiting Anchor Mechanisms for Temporal Action Localization , 2020, IEEE Transactions on Image Processing.

[54]  James M. Rehg,et al.  Tripping through time: Efficient Localization of Activities in Videos , 2019, BMVC.

[55]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[56]  Ali K. Thabet,et al.  G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Xinhang Song,et al.  Scene Recognition With Prototype-Agnostic Scene Layout , 2019, IEEE Transactions on Image Processing.

[58]  Runhao Zeng,et al.  Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Bo Wang,et al.  Learning Long-Term Structural Dependencies for Video Salient Object Detection , 2020, IEEE Transactions on Image Processing.

[60]  Yiannis Andreopoulos,et al.  Graph-based Spatial-temporal Feature Learning for Neuromorphic Vision Sensing , 2019, ArXiv.

[61]  Qinghua Zheng,et al.  Semantics-Preserving Graph Propagation for Zero-Shot Object Detection , 2020, IEEE Transactions on Image Processing.

[62]  Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2019, AAAI.

[63]  Qiong Liu,et al.  MV-GNN: Multi-View Graph Neural Network for Compression Artifacts Reduction , 2020, IEEE Transactions on Image Processing.

[64]  Jiebo Luo,et al.  Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[65]  Guanbin Li,et al.  Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video , 2020, AAAI.

[66]  Heng Tao Shen,et al.  Temporal Reasoning Graph for Activity Recognition , 2019, IEEE Transactions on Image Processing.