The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions

—Temporal sentence grounding in videos (TSGV), a.k.a., natural language video localization (NLVL) or video moment retrieval (VMR), aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video. Connecting computer vision and natural language, TSGV has drawn significant attention from researchers in both communities. This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions. As the background, we present a common structure of functional components in TSGV, in a tutorial style: from feature extraction from raw video and language query, to answer prediction of the target moment. Then we review the techniques for multimodal understanding and interaction, which is the key focus of TSGV for effective alignment between the two modalities. We construct a taxonomy of TSGV techniques and elaborate methods in different categories with their strengths and weaknesses. Lastly, we discuss issues with the current TSGV research and share our insights about promising research directions.

[1]  Jihua Zhu,et al.  Multi-Level Query Interaction for Temporal Language Grounding , 2022, IEEE Transactions on Intelligent Transportation Systems.

[2]  Yuechen Wang,et al.  Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding , 2022, EMNLP.

[3]  Ruixuan Li,et al.  SNEAK: Synonymous Sentences-Aware Adversarial Attack on Natural Language Video Localization , 2021, ArXiv.

[4]  Yu-Gang Jiang,et al.  BEVT: BERT Pretraining of Video Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Fabian Caba Heilbron,et al.  MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Aixin Sun,et al.  Towards Debiasing Temporal Sentence Grounding in Video , 2021, ArXiv.

[7]  Zixi Jia,et al.  STCM-Net: A symmetrical one-stage network for temporal language localization in videos , 2021, Neurocomputing.

[8]  Liqiang Nie,et al.  Hierarchical Deep Residual Reasoning for Temporal Moment Localization , 2021, MMAsia.

[9]  Luxi Yang,et al.  Collaborative Spatial-Temporal Interaction for Language-Based Moment Retrieval , 2021, 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP).

[10]  Kate Saenko,et al.  Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos , 2021, NeurIPS.

[11]  Wei Zhang,et al.  Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval , 2021, ACM Multimedia.

[12]  Yinghui Xu,et al.  AsyNCE: Disentangling False-Positives for Weakly-Supervised Video Grounding , 2021, ACM Multimedia.

[13]  Yu-Gang Jiang,et al.  Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval , 2021, ACM Multimedia.

[14]  Shaoxiang Chen,et al.  Towards Bridging Video and Language by Caption Generation and Sentence Localization , 2021, ACM Multimedia.

[15]  Bernard Ghanem,et al.  Relation-aware Video Reading Comprehension for Temporal Language Grounding , 2021, EMNLP.

[16]  Dong Xu,et al.  STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Changsheng Xu,et al.  Fast Video Moment Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Dmytro Okhonko,et al.  VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[19]  Liqiang Nie,et al.  Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos , 2021, IEEE Transactions on Image Processing.

[20]  Yu-Gang Jiang,et al.  Self-Supervised Learning for Semi-Supervised Temporal Language Grounding , 2021, IEEE Transactions on Multimedia.

[21]  Jun Xiao,et al.  Natural Language Video Localization with Learnable Moment Proposals , 2021, EMNLP.

[22]  Chong-Wah Ngo,et al.  CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval , 2021, ACM Multimedia.

[23]  Wenwu Zhu,et al.  A Survey on Temporal Sentence Grounding in Videos , 2021, ACM Trans. Multim. Comput. Commun. Appl..

[24]  Xiaoye Qu,et al.  Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos , 2021, EMNLP.

[25]  Xiaoye Qu,et al.  Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding , 2021, EMNLP.

[26]  Mike Zheng Shou,et al.  On Pursuit of Designing Multi-modal Transformer for Video Grounding , 2021, EMNLP.

[27]  Tianhao Li,et al.  Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding , 2021, AAAI.

[28]  Jian Yang,et al.  Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization , 2021, IEEE Transactions on Image Processing.

[29]  Dongyeop Kang,et al.  Zero-shot Natural Language Video Localization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Shiwei Zhang,et al.  Support-Set Based Cross-Supervision for Video Grounding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Shiyu Ji,et al.  Local-enhanced Interaction for Temporal Moment Localization , 2021, ICMR.

[32]  Tamara L. Berg,et al.  mTVR: Multilingual Moment Retrieval in Videos , 2021, ACL.

[33]  Shaogang Gong,et al.  Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Yilong Yin,et al.  Single-shot Semantic Matching Network for Moment Localization in Videos , 2021, ACM Trans. Multim. Comput. Commun. Appl..

[35]  Tamara L. Berg,et al.  QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries , 2021, ArXiv.

[36]  Ming-Hsuan Yang,et al.  End-to-end Multi-modal Video Temporal Grounding , 2021, NeurIPS.

[37]  Mohsen Malmir,et al.  Cross Interaction Network for Natural Language Guided Video Moment Retrieval , 2021, SIGIR.

[38]  Junyu Gao,et al.  Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[39]  Wengang Zhou,et al.  Weakly Supervised Temporal Adjacent Network for Language Grounding , 2021, IEEE Transactions on Multimedia.

[40]  Tatsuya Harada,et al.  Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair , 2021, ArXiv.

[41]  Liqiang Nie,et al.  Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization , 2021, IEEE Transactions on Image Processing.

[42]  Hanli Wang,et al.  MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval , 2021, IEEE Transactions on Image Processing.

[43]  Meng Wang,et al.  Deconfounded Video Moment Retrieval with Causal Intervention , 2021, SIGIR.

[44]  Zhou Zhao,et al.  Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yu-Gang Jiang,et al.  Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Rui Qiao,et al.  Interventional Video Grounding with Dual Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Zhou Zhao,et al.  Cascaded Prediction Network via Segment Tree for Temporal Video Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Zhengjun Zha,et al.  Structured Multi-Level Interaction Network for Video Moment Localization via Language Query , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Heng Tao Shen,et al.  Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Dan Guo,et al.  Proposal-Free Video Grounding with Contextual Pyramid Network , 2021, AAAI.

[51]  Yadong Mu,et al.  Dense Events Grounding in Video , 2021, AAAI.

[52]  Li Niu,et al.  Activity Image-to-Video Retrieval by Disentangling Appearance and Motion , 2021, AAAI.

[53]  Joey Tianyi Zhou,et al.  Parallel Attention Network with Sequence Matching for Video Grounding , 2021, FINDINGS.

[54]  Liangli Zhen,et al.  Video Corpus Moment Retrieval with Contrastive Learning , 2021, SIGIR.

[55]  Junyu Gao,et al.  Learning Video Moment Retrieval Without a Single Annotated Video , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[56]  Liqiang Nie,et al.  Video Moment Localization via Deep Cross-Modal Hashing , 2021, IEEE Transactions on Image Processing.

[57]  Wen Wang,et al.  DCT-net: A deep co-interactive transformer network for video temporal grounding , 2021, Image Vis. Comput..

[58]  Yilong Yin,et al.  A Survey on Natural Language Video Localization , 2021, ArXiv.

[59]  Jianfeng Dong,et al.  Context-aware Biaffine Localizing Network for Temporal Sentence Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Yi Yang,et al.  Decoupled Spatial Temporal Graphs for Generic Visual Grounding , 2021, ArXiv.

[61]  Wei Ji,et al.  Boundary Proposal Network for Two-Stage Natural Language Video Localization , 2021, AAAI.

[62]  Liangli Zhen,et al.  Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Yongdong Zhang,et al.  Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding , 2021, IEEE Transactions on Image Processing.

[64]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Jianfeng Dong,et al.  Progressive Localization Networks for Language-Based Moment Localization , 2021, ACM Trans. Multim. Comput. Commun. Appl..

[66]  Wenwu Zhu,et al.  A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric , 2021, HUMA @ ACM Multimedia.

[67]  Qi Tian,et al.  Interaction-Integrated Network for Natural Language Moment Localization , 2021, IEEE Transactions on Image Processing.

[68]  Jiebo Luo,et al.  Multi-Scale 2D Temporal Adjacency Networks for Moment Localization With Natural Language , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Pan Zhou,et al.  Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network , 2020, COLING.

[70]  Tao Xiang,et al.  Boundary-sensitive Pre-training for Temporal Localization in Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Bernard Ghanem,et al.  VLG-Net: Video-Language Graph Matching Network for Video Grounding , 2020, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[72]  Ming Zhao,et al.  A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus , 2020, ArXiv.

[73]  Xiaojie Jin,et al.  Human-Centric Spatio-Temporal Video Grounding With Visual Transformers , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[74]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[75]  Basura Fernando,et al.  DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video , 2020, ArXiv.

[76]  Richang Hong,et al.  Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization , 2020, ACM Multimedia.

[77]  Zhiwei Xiong,et al.  Dual Path Interaction Network for Video Moment Localization , 2020, ACM Multimedia.

[78]  Zheng Qin,et al.  STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization , 2020, ACM Multimedia.

[79]  Runhao Zeng,et al.  Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization , 2020, ACM Multimedia.

[80]  Yu Kong,et al.  Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos , 2020, ACM Multimedia.

[81]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[82]  Zhaohui Li,et al.  A Survey of Temporal Activity Localization via Language in Untrimmed Videos , 2020, 2020 International Conference on Culture-oriented Science & Technology (ICCST).

[83]  Dejing Xu,et al.  A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention , 2020, ArXiv.

[84]  Jihua Zhu,et al.  Frame-Wise Cross-Modal Matching for Video Moment Retrieval , 2020, IEEE Transactions on Multimedia.

[85]  Jie Wu,et al.  Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos , 2020, ACM Multimedia.

[86]  Esa Rahtu,et al.  Uncovering Hidden Challenges in Query-Based Video Moment Retrieval , 2020, BMVC.

[87]  Fei Wu,et al.  An Attentive Sequence to Sequence Translator for Localizing Video Clips by Natural Language , 2020, IEEE Transactions on Multimedia.

[88]  C. Yoo,et al.  VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval , 2020, ECCV.

[89]  Amit K. Roy-Chowdhury,et al.  Text-Based Localization of Moments in a Video Corpus , 2020, IEEE Transactions on Image Processing.

[90]  Jieming Zhu,et al.  Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos , 2020, ACM Multimedia.

[91]  Yan Yan,et al.  Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[92]  Yu Cheng,et al.  Fine-grained Iterative Attention Network for Temporal Language Localization in Videos , 2020, ACM Multimedia.

[93]  Pan Zhou,et al.  Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization , 2020, ACM Multimedia.

[94]  Yu-Gang Jiang,et al.  Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos , 2020, ECCV.

[95]  Ye Wang,et al.  Deep Graph Random Process for Relational-Thinking-Based Speech Recognition , 2020, ICML.

[96]  Qing Li,et al.  Aligned Dual Channel Graph Convolutional Network for Visual Question Answering , 2020, ACL.

[97]  Kai Shen,et al.  Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description , 2020, IJCAI.

[98]  Zhijie Lin,et al.  Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding , 2020, IJCAI.

[99]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[100]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[101]  Juntao Yu,et al.  Named Entity Recognition as Dependency Parsing , 2020, ACL.

[102]  Licheng Yu,et al.  Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[103]  Bohyung Han,et al.  Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[104]  Runhao Zeng,et al.  Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[105]  Long Chen,et al.  Rethinking the Bottom-Up Framework for Query-Based Video Localization , 2020, AAAI.

[106]  Shyh-Kang Jeng,et al.  Weakly-Supervised Video Re-Localization with Multiscale Attention Model , 2020, AAAI.

[107]  Yan Yan,et al.  Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization , 2020, AAAI.

[108]  Hao Zhang,et al.  Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.

[109]  Kan Chen,et al.  Video Object Grounding Using Semantic Roles in Language Description , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[110]  Dan Jurafsky,et al.  Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.

[111]  Zhou Yu,et al.  Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos , 2020, ArXiv.

[112]  Wenhan Luo,et al.  Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video , 2020, ArXiv.

[113]  Mohit Bansal,et al.  TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[114]  Zhou Zhao,et al.  Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[115]  Guanbin Li,et al.  Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video , 2020, AAAI.

[116]  Zijian Zhang,et al.  Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction , 2020, IEEE Transactions on Image Processing.

[117]  Jiebo Luo,et al.  Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[118]  Ali K. Thabet,et al.  G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[119]  Li Niu,et al.  A Proposal-based Approach for Activity Image-to-Video Retrieval , 2019, AAAI.

[120]  Zhou Zhao,et al.  Weakly-Supervised Video Moment Retrieval via Semantic Completion Network , 2019, AAAI.

[121]  Long Chen,et al.  DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization , 2019, EMNLP.

[122]  Yitian Yuan,et al.  Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[123]  Yan Yan,et al.  Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[124]  Bryan A. Plummer,et al.  LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[125]  Wenhao Jiang,et al.  Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2019, AAAI.

[126]  Hongdong Li,et al.  Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[127]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[128]  Jiebo Luo,et al.  Exploiting Temporal Relationships in Video Moment Localization with Natural Language , 2019, ACM Multimedia.

[129]  Larry S. Davis,et al.  WSLLN:Weakly Supervised Natural Language Localization Networks , 2019, EMNLP.

[130]  Bernard Ghanem,et al.  Temporal Localization of Moments in Video Collections with Natural Language , 2019, ArXiv.

[131]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[132]  Shilei Wen,et al.  BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[133]  Jiebo Luo,et al.  Localizing Natural Language in Videos , 2019, AAAI.

[134]  Yu-Gang Jiang,et al.  Semantic Proposal for Activity Localization in Videos via Sentence Query , 2019, AAAI.

[135]  Rick Siow Mong Goh,et al.  Dual Adversarial Neural Transfer for Low-Resource Named Entity Recognition , 2019, ACL.

[136]  Deng Cai,et al.  Localizing Unseen Activities in Video via Image Query , 2019, IJCAI.

[137]  Lin Ma,et al.  Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video , 2019, ACL.

[138]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[139]  Bin Jiang,et al.  Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention , 2019, ICMR.

[140]  Liang Wang,et al.  Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[141]  Boqing Gong,et al.  Not All Frames Are Equal: Weakly-Supervised Video Grounding With Contextual Similarity and Visual Clustering Losses , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[142]  Yang Feng,et al.  Spatio-Temporal Video Re-Localization by Warp LSTM , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[143]  James M. Rehg,et al.  Tripping through time: Efficient Localization of Activities in Videos , 2019, BMVC.

[144]  Jimmy J. Lin,et al.  Simple BERT Models for Relation Extraction and Semantic Role Labeling , 2019, ArXiv.

[145]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[146]  Alexander G. Hauptmann,et al.  ExCL: Extractive Clip Localization Using Natural Language Descriptions , 2019, NAACL.

[147]  Xiao Liu,et al.  Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos , 2019, AAAI.

[148]  Chuang Gan,et al.  Weakly Supervised Dense Event Captioning in Videos , 2018, NeurIPS.

[149]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[150]  Ramakant Nevatia,et al.  MAC: Mining Activity Concepts for Language-Based Temporal Localization , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[151]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[152]  Yu Qiao,et al.  Find and Focus: Retrieve and Localize Video Events with Natural Language Queries , 2018, ECCV.

[153]  Juan Carlos Niebles,et al.  Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos , 2018, ECCV.

[154]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[155]  George Vogiatzis,et al.  Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval , 2018, BMVC.

[156]  Trevor Darrell,et al.  Localizing Moments in Video with Temporal Language , 2018, EMNLP.

[157]  Yang Feng,et al.  Video Re-localization , 2018, ECCV.

[158]  Yahong Han,et al.  Multi-modal Circulant Fusion for Video-to-Language and Backward , 2018, IJCAI.

[159]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.

[160]  Juan Carlos Niebles,et al.  Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[161]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[162]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[163]  Kate Saenko,et al.  Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning , 2018, ArXiv.

[164]  Chenliang Xu,et al.  Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.

[165]  Yelong Shen,et al.  FusionNet: Fusing via Fully-Aware Attention with Application to Machine Comprehension , 2017, ICLR.

[166]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[167]  Kaiqi Huang,et al.  A2-RL: Aesthetics Aware Reinforcement Learning for Image Cropping , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[168]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[169]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[170]  Ming Zhou,et al.  Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.

[171]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[172]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[173]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[174]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[175]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[176]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[177]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[178]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[179]  J. Pearl,et al.  Causal Inference in Statistics: A Primer , 2016 .

[180]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[181]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[182]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[183]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[184]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[185]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[186]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[187]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[188]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[189]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[190]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[191]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[192]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[193]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[194]  J. Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[195]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[196]  Wenwu Wang,et al.  Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention , 2023, IEEE Transactions on Multimedia.

[197]  Yilong Yin,et al.  Regularized Two Granularity Loss Function for Weakly Supervised Video Moment Retrieval , 2022, IEEE Transactions on Multimedia.

[198]  Zhou Zhao,et al.  Temporal Textual Localization in Video via Adversarial Bi-Directional Interaction Networks , 2021, IEEE Transactions on Multimedia.

[199]  Yu-Gang Jiang,et al.  Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language , 2020, ECCV.

[200]  Zhou Zhao,et al.  The Supplementary Material: Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding , 2020 .

[201]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[202]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[203]  A. Shapiro Monte Carlo Sampling Methods , 2003 .

[204]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.