Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Daizong Liu1,2, Xiaoye Qu2, Xing Di3, Yu Cheng4, Zichuan Xu5, Pan Zhou1* 1The Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology 2School of Electronic Information and Communication, Huazhong University of Science and Technology 3ProtagoLabs Inc 4Microsoft Research 5Dalian University of Technology {dzliu, xiaoye, panzhou}@hust.edu.cn, xing.di@protagolabs.com, yu.cheng@microsoft.com, z.xu@dlut.edu.cn Abstract

[1]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[2]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[4]  Yu Cheng,et al.  Fine-grained Iterative Attention Network for Temporal Language Localization in Videos , 2020, ACM Multimedia.

[5]  Jian Shao,et al.  Boundary Proposal Network for Two-Stage Natural Language Video Localization , 2021, AAAI.

[6]  Wei Liu,et al.  Recurrent Fusion Network for Image Captioning , 2018, ECCV.

[7]  Hao Zhang,et al.  Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.

[8]  Jingwen Wang,et al.  Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2020, AAAI.

[9]  Pan Zhou,et al.  Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network , 2020, COLING.

[10]  Xiao-Yang Liu,et al.  Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization , 2020, ACM Multimedia.

[11]  Yoshua Bengio,et al.  An Empirical Study of Example Forgetting during Deep Neural Network Learning , 2018, ICLR.

[12]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Liang Wang,et al.  Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[16]  Runhao Zeng,et al.  Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[18]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[19]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[20]  Xiaoye Qu,et al.  Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos , 2021, EMNLP.

[21]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[22]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[23]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.

[24]  Hongdong Li,et al.  Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[25]  Ramakant Nevatia,et al.  MAC: Mining Activity Concepts for Language-Based Temporal Localization , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Yu-Gang Jiang,et al.  Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos , 2020, ECCV.

[29]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Chunhua Shen,et al.  Visual Question Answering with Memory-Augmented Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Mike Zheng Shou,et al.  On Pursuit of Designing Multi-modal Transformer for Video Grounding , 2021, EMNLP.

[34]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[35]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[37]  Xiaoye Qu,et al.  Exploring Motion and Appearance Information for Temporal Sentence Grounding , 2022, AAAI.

[38]  Xiaoye Qu,et al.  Unsupervised Temporal Video Grounding with Deep Semantic Clustering , 2022, AAAI.

[39]  Yitian Yuan,et al.  Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[41]  Long Chen,et al.  Rethinking the Bottom-Up Framework for Query-Based Video Localization , 2020, AAAI.

[42]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[43]  Rui Qiao,et al.  Interventional Video Grounding with Dual Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[45]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Xiaoye Qu,et al.  Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding , 2021, EMNLP.

[47]  Bohyung Han,et al.  Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jiebo Luo,et al.  Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[49]  Yan Huang,et al.  ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Jianfeng Dong,et al.  Context-aware Biaffine Localizing Network for Temporal Sentence Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).