STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

In this article, we tackle the cross-modal video moment localization issue, namely, localizing the most relevant video moment in an untrimmed video given a sentence as the query. The majority of existing methods focus on generating video moment candidates with the help of multi-scale sliding window segmentation. They hence inevitably suffer from numerous candidates, which result in the less effective retrieval process. In addition, the spatial scene tracking is crucial for realizing the video moment localization process, but it is rarely considered in traditional techniques. To this end, we innovatively contribute a spatial-temporal reinforcement learning framework. Specifically, we first exploit a temporal-level reinforcement learning to dynamically adjust the boundary of localized video moment instead of the traditional window segmentation strategy, which is able to accelerate the localization process. Thereafter, a spatial-level reinforcement learning is proposed to track the scene on consecutive image frames, therefore filtering out less relevant information. Lastly, an alternative optimization strategy is proposed to jointly optimize the temporal- and spatial-level reinforcement learning. Thereinto, the two tasks of temporal boundary localization and spatial scene tracking are mutually reinforced. By experimenting on two real-world datasets, we demonstrate the effectiveness and rationality of our proposed solution.

[1]  Shaogang Gong,et al.  Deep Reinforcement Active Learning for Human-in-the-Loop Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Yixin Cao,et al.  Reinforced Negative Sampling over Knowledge Graph for Recommendation , 2020, WWW.

[3]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Hongxia Jin,et al.  Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning , 2019, ACM Multimedia.

[5]  Guangyi Xiao,et al.  Social-Enhanced Attentive Group Recommendation , 2019, IEEE Transactions on Knowledge and Data Engineering.

[6]  Depeng Jin,et al.  Reinforced Negative Sampling for Recommendation with Exposure Data , 2019, IJCAI.

[7]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[8]  Ling Shao,et al.  Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval , 2019, IEEE Transactions on Image Processing.

[9]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[11]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Jin Young Choi,et al.  Action-Decision Networks for Visual Tracking with Deep Reinforcement Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[14]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[15]  Yang Yang,et al.  Video-based Person Re-identification via Self-Paced Learning and Deep Reinforcement Learning Framework , 2018, ACM Multimedia.

[16]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.

[17]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[18]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[19]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[20]  Svetlana Lazebnik,et al.  Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[21]  Jiebo Luo,et al.  Localizing Natural Language in Videos , 2019, AAAI.

[22]  Qi Tian,et al.  Video-Based Cross-Modal Recipe Retrieval , 2019, ACM Multimedia.

[23]  Liang Wang,et al.  Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Kaiming He,et al.  Long-Term Feature Banks for Detailed Video Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zi Huang,et al.  Curiosity-driven Reinforcement Learning for Diverse Visual Paragraph Generation , 2019, ACM Multimedia.

[26]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Tong Xu,et al.  Latency Aware Adaptive Video Streaming using Ensemble Deep Reinforcement Learning , 2019, ACM Multimedia.

[28]  Bin Jiang,et al.  Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention , 2019, ICMR.

[29]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[30]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Marco Winckler,et al.  User-Adaptive Editing for 360 degree Video Streaming with Deep Reinforcement Learning , 2019, ACM Multimedia.

[32]  Xiao Liu,et al.  Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos , 2019, AAAI.

[33]  Chao Yang,et al.  Attentive Group Recommendation , 2018, SIGIR.

[34]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[35]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[36]  Trevor Darrell,et al.  Localizing Moments in Video with Temporal Language , 2018, EMNLP.

[37]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[38]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ankit Shah,et al.  Natural Language Person Search Using Deep Reinforcement Learning , 2018, ArXiv.

[41]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Zhe Gan,et al.  Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation , 2018, AAAI.

[43]  Jiebo Luo,et al.  Exploiting Temporal Relationships in Video Moment Localization with Natural Language , 2019, ACM Multimedia.