Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction

Moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including the syntactic dependencies of natural language queries, long-range semantic dependencies in video context and the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning and propose a multi-head self-attention to capture long-range semantic dependencies from video context. Next, we employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents, and we also consider query reconstruction from the cross-modal representations of target moment as an auxiliary task to strengthen the cross-modal representations. The extensive experiments on ActivityNet Captions and TACoS demonstrate the effectiveness of our proposed method.

[1]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[2]  Sanja Fidler,et al.  Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Stefanie Tellex,et al.  Towards surveillance video search by natural language query , 2009, CIVR '09.

[4]  Zhou Yu,et al.  Open-Ended Long-form Video Question Answering via Adaptive Hierarchical Reinforced Networks , 2018, IJCAI.

[5]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[6]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[11]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[12]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[18]  Naokazu Yokoya,et al.  Learning Joint Representations of Videos and Sentences with Web Image Search , 2016, ECCV Workshops.

[19]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Quoc V. Le,et al.  Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[21]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[22]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[24]  Wei Liu,et al.  Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[26]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Trevor Darrell,et al.  Localizing Moments in Video with Temporal Language , 2018, EMNLP.

[28]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[30]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[31]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[33]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[34]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[35]  Silvio Savarese,et al.  Unsupervised Semantic Parsing of Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Yang Feng,et al.  Video Re-localization , 2018, ECCV.

[38]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Wei Xu,et al.  Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ramakant Nevatia,et al.  Automatic Concept Discovery from Parallel Text and Visual Corpora , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Diego Marcheggiani,et al.  Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling , 2017, EMNLP.

[42]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[43]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Rainer Stiefelhagen,et al.  Book2Movie: Aligning video scenes with book chapters , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Abdullah Al Mamun,et al.  Unsupervised Alignment of Actions in Video with Text Descriptions , 2016, IJCAI.

[46]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.

[48]  Yitian Yuan,et al.  Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Cordelia Schmid,et al.  Weakly-Supervised Alignment of Video with Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Ivan Laptev,et al.  Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).