The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions
暂无分享,去创建一个
Aixin Sun | Wei Jing | Hao Zhang | Joey Tianyi Zhou | Joey Tianyi Zhou | Aixin Sun | J. Zhou | Hao Zhang | Wei Jing
[1] Jihua Zhu,et al. Multi-Level Query Interaction for Temporal Language Grounding , 2022, IEEE Transactions on Intelligent Transportation Systems.
[2] Yuechen Wang,et al. Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding , 2022, EMNLP.
[3] Ruixuan Li,et al. SNEAK: Synonymous Sentences-Aware Adversarial Attack on Natural Language Video Localization , 2021, ArXiv.
[4] Yu-Gang Jiang,et al. BEVT: BERT Pretraining of Video Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Fabian Caba Heilbron,et al. MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Aixin Sun,et al. Towards Debiasing Temporal Sentence Grounding in Video , 2021, ArXiv.
[7] Zixi Jia,et al. STCM-Net: A symmetrical one-stage network for temporal language localization in videos , 2021, Neurocomputing.
[8] Liqiang Nie,et al. Hierarchical Deep Residual Reasoning for Temporal Moment Localization , 2021, MMAsia.
[9] Luxi Yang,et al. Collaborative Spatial-Temporal Interaction for Language-Based Moment Retrieval , 2021, 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP).
[10] Kate Saenko,et al. Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos , 2021, NeurIPS.
[11] Wei Zhang,et al. Learning Segment Similarity and Alignment in Large-Scale Content Based Video Retrieval , 2021, ACM Multimedia.
[12] Yinghui Xu,et al. AsyNCE: Disentangling False-Positives for Weakly-Supervised Video Grounding , 2021, ACM Multimedia.
[13] Yu-Gang Jiang,et al. Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval , 2021, ACM Multimedia.
[14] Shaoxiang Chen,et al. Towards Bridging Video and Language by Caption Generation and Sentence Localization , 2021, ACM Multimedia.
[15] Bernard Ghanem,et al. Relation-aware Video Reading Comprehension for Temporal Language Grounding , 2021, EMNLP.
[16] Dong Xu,et al. STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[17] Changsheng Xu,et al. Fast Video Moment Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[18] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.
[19] Liqiang Nie,et al. Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos , 2021, IEEE Transactions on Image Processing.
[20] Yu-Gang Jiang,et al. Self-Supervised Learning for Semi-Supervised Temporal Language Grounding , 2021, IEEE Transactions on Multimedia.
[21] Jun Xiao,et al. Natural Language Video Localization with Learnable Moment Proposals , 2021, EMNLP.
[22] Chong-Wah Ngo,et al. CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval , 2021, ACM Multimedia.
[23] Wenwu Zhu,et al. A Survey on Temporal Sentence Grounding in Videos , 2021, ACM Trans. Multim. Comput. Commun. Appl..
[24] Xiaoye Qu,et al. Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos , 2021, EMNLP.
[25] Xiaoye Qu,et al. Progressively Guide to Attend: An Iterative Alignment Framework for Temporal Sentence Grounding , 2021, EMNLP.
[26] Mike Zheng Shou,et al. On Pursuit of Designing Multi-modal Transformer for Video Grounding , 2021, EMNLP.
[27] Tianhao Li,et al. Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding , 2021, AAAI.
[28] Jian Yang,et al. Discriminative Cross-Modality Attention Network for Temporal Inconsistent Audio-Visual Event Localization , 2021, IEEE Transactions on Image Processing.
[29] Dongyeop Kang,et al. Zero-shot Natural Language Video Localization , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[30] Shiwei Zhang,et al. Support-Set Based Cross-Supervision for Video Grounding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Shiyu Ji,et al. Local-enhanced Interaction for Temporal Moment Localization , 2021, ICMR.
[32] Tamara L. Berg,et al. mTVR: Multilingual Moment Retrieval in Videos , 2021, ACL.
[33] Shaogang Gong,et al. Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[34] Yilong Yin,et al. Single-shot Semantic Matching Network for Moment Localization in Videos , 2021, ACM Trans. Multim. Comput. Commun. Appl..
[35] Tamara L. Berg,et al. QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries , 2021, ArXiv.
[36] Ming-Hsuan Yang,et al. End-to-end Multi-modal Video Temporal Grounding , 2021, NeurIPS.
[37] Mohsen Malmir,et al. Cross Interaction Network for Natural Language Guided Video Moment Retrieval , 2021, SIGIR.
[38] Junyu Gao,et al. Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).
[39] Wengang Zhou,et al. Weakly Supervised Temporal Adjacent Network for Language Grounding , 2021, IEEE Transactions on Multimedia.
[40] Tatsuya Harada,et al. Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair , 2021, ArXiv.
[41] Liqiang Nie,et al. Coarse-to-Fine Semantic Alignment for Cross-Modal Moment Localization , 2021, IEEE Transactions on Image Processing.
[42] Hanli Wang,et al. MABAN: Multi-Agent Boundary-Aware Network for Natural Language Moment Retrieval , 2021, IEEE Transactions on Image Processing.
[43] Meng Wang,et al. Deconfounded Video Moment Retrieval with Causal Intervention , 2021, SIGIR.
[44] Zhou Zhao,et al. Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Yu-Gang Jiang,et al. Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Rui Qiao,et al. Interventional Video Grounding with Dual Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Zhou Zhao,et al. Cascaded Prediction Network via Segment Tree for Temporal Video Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Zhengjun Zha,et al. Structured Multi-Level Interaction Network for Video Moment Localization via Language Query , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Heng Tao Shen,et al. Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Dan Guo,et al. Proposal-Free Video Grounding with Contextual Pyramid Network , 2021, AAAI.
[51] Yadong Mu,et al. Dense Events Grounding in Video , 2021, AAAI.
[52] Li Niu,et al. Activity Image-to-Video Retrieval by Disentangling Appearance and Motion , 2021, AAAI.
[53] Joey Tianyi Zhou,et al. Parallel Attention Network with Sequence Matching for Video Grounding , 2021, FINDINGS.
[54] Liangli Zhen,et al. Video Corpus Moment Retrieval with Contrastive Learning , 2021, SIGIR.
[55] Junyu Gao,et al. Learning Video Moment Retrieval Without a Single Annotated Video , 2021, IEEE Transactions on Circuits and Systems for Video Technology.
[56] Liqiang Nie,et al. Video Moment Localization via Deep Cross-Modal Hashing , 2021, IEEE Transactions on Image Processing.
[57] Wen Wang,et al. DCT-net: A deep co-interactive transformer network for video temporal grounding , 2021, Image Vis. Comput..
[58] Yilong Yin,et al. A Survey on Natural Language Video Localization , 2021, ArXiv.
[59] Jianfeng Dong,et al. Context-aware Biaffine Localizing Network for Temporal Sentence Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Yi Yang,et al. Decoupled Spatial Temporal Graphs for Generic Visual Grounding , 2021, ArXiv.
[61] Wei Ji,et al. Boundary Proposal Network for Two-Stage Natural Language Video Localization , 2021, AAAI.
[62] Liangli Zhen,et al. Natural Language Video Localization: A Revisit in Span-Based Question Answering Framework , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[63] Yongdong Zhang,et al. Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding , 2021, IEEE Transactions on Image Processing.
[64] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Jianfeng Dong,et al. Progressive Localization Networks for Language-Based Moment Localization , 2021, ACM Trans. Multim. Comput. Commun. Appl..
[66] Wenwu Zhu,et al. A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric , 2021, HUMA @ ACM Multimedia.
[67] Qi Tian,et al. Interaction-Integrated Network for Natural Language Moment Localization , 2021, IEEE Transactions on Image Processing.
[68] Jiebo Luo,et al. Multi-Scale 2D Temporal Adjacency Networks for Moment Localization With Natural Language , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[69] Pan Zhou,et al. Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network , 2020, COLING.
[70] Tao Xiang,et al. Boundary-sensitive Pre-training for Temporal Localization in Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[71] Bernard Ghanem,et al. VLG-Net: Video-Language Graph Matching Network for Video Grounding , 2020, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).
[72] Ming Zhao,et al. A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus , 2020, ArXiv.
[73] Xiaojie Jin,et al. Human-Centric Spatio-Temporal Video Grounding With Visual Transformers , 2020, IEEE Transactions on Circuits and Systems for Video Technology.
[74] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[75] Basura Fernando,et al. DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video , 2020, ArXiv.
[76] Richang Hong,et al. Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization , 2020, ACM Multimedia.
[77] Zhiwei Xiong,et al. Dual Path Interaction Network for Video Moment Localization , 2020, ACM Multimedia.
[78] Zheng Qin,et al. STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization , 2020, ACM Multimedia.
[79] Runhao Zeng,et al. Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization , 2020, ACM Multimedia.
[80] Yu Kong,et al. Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos , 2020, ACM Multimedia.
[81] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.
[82] Zhaohui Li,et al. A Survey of Temporal Activity Localization via Language in Untrimmed Videos , 2020, 2020 International Conference on Culture-oriented Science & Technology (ICCST).
[83] Dejing Xu,et al. A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention , 2020, ArXiv.
[84] Jihua Zhu,et al. Frame-Wise Cross-Modal Matching for Video Moment Retrieval , 2020, IEEE Transactions on Multimedia.
[85] Jie Wu,et al. Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos , 2020, ACM Multimedia.
[86] Esa Rahtu,et al. Uncovering Hidden Challenges in Query-Based Video Moment Retrieval , 2020, BMVC.
[87] Fei Wu,et al. An Attentive Sequence to Sequence Translator for Localizing Video Clips by Natural Language , 2020, IEEE Transactions on Multimedia.
[88] C. Yoo,et al. VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval , 2020, ECCV.
[89] Amit K. Roy-Chowdhury,et al. Text-Based Localization of Moments in a Video Corpus , 2020, IEEE Transactions on Image Processing.
[90] Jieming Zhu,et al. Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos , 2020, ACM Multimedia.
[91] Yan Yan,et al. Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[92] Yu Cheng,et al. Fine-grained Iterative Attention Network for Temporal Language Localization in Videos , 2020, ACM Multimedia.
[93] Pan Zhou,et al. Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization , 2020, ACM Multimedia.
[94] Yu-Gang Jiang,et al. Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos , 2020, ECCV.
[95] Ye Wang,et al. Deep Graph Random Process for Relational-Thinking-Based Speech Recognition , 2020, ICML.
[96] Qing Li,et al. Aligned Dual Channel Graph Convolutional Network for Visual Question Answering , 2020, ACL.
[97] Kai Shen,et al. Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description , 2020, IJCAI.
[98] Zhijie Lin,et al. Object-Aware Multi-Branch Relation Networks for Spatio-Temporal Video Grounding , 2020, IJCAI.
[99] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[100] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[101] Juntao Yu,et al. Named Entity Recognition as Dependency Parsing , 2020, ACL.
[102] Licheng Yu,et al. Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[103] Bohyung Han,et al. Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[104] Runhao Zeng,et al. Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[105] Long Chen,et al. Rethinking the Bottom-Up Framework for Query-Based Video Localization , 2020, AAAI.
[106] Shyh-Kang Jeng,et al. Weakly-Supervised Video Re-Localization with Multiscale Attention Model , 2020, AAAI.
[107] Yan Yan,et al. Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization , 2020, AAAI.
[108] Hao Zhang,et al. Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.
[109] Kan Chen,et al. Video Object Grounding Using Semantic Roles in Language Description , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[110] Dan Jurafsky,et al. Racial disparities in automated speech recognition , 2020, Proceedings of the National Academy of Sciences.
[111] Zhou Yu,et al. Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos , 2020, ArXiv.
[112] Wenhan Luo,et al. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video , 2020, ArXiv.
[113] Mohit Bansal,et al. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.
[114] Zhou Zhao,et al. Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[115] Guanbin Li,et al. Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video , 2020, AAAI.
[116] Zijian Zhang,et al. Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction , 2020, IEEE Transactions on Image Processing.
[117] Jiebo Luo,et al. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.
[118] Ali K. Thabet,et al. G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[119] Li Niu,et al. A Proposal-based Approach for Activity Image-to-Video Retrieval , 2019, AAAI.
[120] Zhou Zhao,et al. Weakly-Supervised Video Moment Retrieval via Semantic Completion Network , 2019, AAAI.
[121] Long Chen,et al. DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization , 2019, EMNLP.
[122] Yitian Yuan,et al. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[123] Yan Yan,et al. Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[124] Bryan A. Plummer,et al. LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[125] Wenhao Jiang,et al. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2019, AAAI.
[126] Hongdong Li,et al. Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[127] Iryna Gurevych,et al. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.
[128] Jiebo Luo,et al. Exploiting Temporal Relationships in Video Moment Localization with Natural Language , 2019, ACM Multimedia.
[129] Larry S. Davis,et al. WSLLN:Weakly Supervised Natural Language Localization Networks , 2019, EMNLP.
[130] Bernard Ghanem,et al. Temporal Localization of Moments in Video Collections with Natural Language , 2019, ArXiv.
[131] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[132] Shilei Wen,et al. BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[133] Jiebo Luo,et al. Localizing Natural Language in Videos , 2019, AAAI.
[134] Yu-Gang Jiang,et al. Semantic Proposal for Activity Localization in Videos via Sentence Query , 2019, AAAI.
[135] Rick Siow Mong Goh,et al. Dual Adversarial Neural Transfer for Low-Resource Named Entity Recognition , 2019, ACL.
[136] Deng Cai,et al. Localizing Unseen Activities in Video via Image Query , 2019, IJCAI.
[137] Lin Ma,et al. Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video , 2019, ACL.
[138] Zhou Zhao,et al. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.
[139] Bin Jiang,et al. Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention , 2019, ICMR.
[140] Liang Wang,et al. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[141] Boqing Gong,et al. Not All Frames Are Equal: Weakly-Supervised Video Grounding With Contextual Similarity and Visual Clustering Losses , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[142] Yang Feng,et al. Spatio-Temporal Video Re-Localization by Warp LSTM , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[143] James M. Rehg,et al. Tripping through time: Efficient Localization of Activities in Videos , 2019, BMVC.
[144] Jimmy J. Lin,et al. Simple BERT Models for Relation Extraction and Semantic Role Labeling , 2019, ArXiv.
[145] Amit K. Roy-Chowdhury,et al. Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[146] Alexander G. Hauptmann,et al. ExCL: Extractive Clip Localization Using Natural Language Descriptions , 2019, NAACL.
[147] Xiao Liu,et al. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos , 2019, AAAI.
[148] Chuang Gan,et al. Weakly Supervised Dense Event Captioning in Videos , 2018, NeurIPS.
[149] Larry S. Davis,et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[150] Ramakant Nevatia,et al. MAC: Mining Activity Concepts for Language-Based Temporal Localization , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).
[151] Qi Tian,et al. Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.
[152] Yu Qiao,et al. Find and Focus: Retrieve and Localize Video Events with Natural Language Queries , 2018, ECCV.
[153] Juan Carlos Niebles,et al. Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos , 2018, ECCV.
[154] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[155] George Vogiatzis,et al. Asymmetric Spatio-Temporal Embeddings for Large-Scale Image-to-Video Retrieval , 2018, BMVC.
[156] Trevor Darrell,et al. Localizing Moments in Video with Temporal Language , 2018, EMNLP.
[157] Yang Feng,et al. Video Re-localization , 2018, ECCV.
[158] Yahong Han,et al. Multi-modal Circulant Fusion for Video-to-Language and Backward , 2018, IJCAI.
[159] Meng Liu,et al. Attentive Moment Retrieval in Videos , 2018, SIGIR.
[160] Juan Carlos Niebles,et al. Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[161] Tao Mei,et al. To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.
[162] Kate Saenko,et al. Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.
[163] Kate Saenko,et al. Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning , 2018, ArXiv.
[164] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[165] Yelong Shen,et al. FusionNet: Fusing via Fully-Aware Attention with Application to Machine Comprehension , 2017, ICLR.
[166] Christopher Clark,et al. Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.
[167] Kaiqi Huang,et al. A2-RL: Aesthetics Aware Reinforcement Learning for Image Cropping , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[168] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[169] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[170] Ming Zhou,et al. Gated Self-Matching Networks for Reading Comprehension and Question Answering , 2017, ACL.
[171] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[172] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[173] Holger Schwenk,et al. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.
[174] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.
[175] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[176] Ali Farhadi,et al. Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.
[177] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[178] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[179] J. Pearl,et al. Causal Inference in Statistics: A Primer , 2016 .
[180] Eduard H. Hovy,et al. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.
[181] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.
[182] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[183] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[184] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.
[185] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[186] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[187] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.
[188] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[189] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[190] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[191] Mihai Surdeanu,et al. The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.
[192] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.
[193] Bernt Schiele,et al. Grounding Action Descriptions in Videos , 2013, TACL.
[194] J. Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.
[195] Bernt Schiele,et al. Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.
[196] Wenwu Wang,et al. Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention , 2023, IEEE Transactions on Multimedia.
[197] Yilong Yin,et al. Regularized Two Granularity Loss Function for Weakly Supervised Video Moment Retrieval , 2022, IEEE Transactions on Multimedia.
[198] Zhou Zhao,et al. Temporal Textual Localization in Video via Adversarial Bi-Directional Interaction Networks , 2021, IEEE Transactions on Multimedia.
[199] Yu-Gang Jiang,et al. Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language , 2020, ECCV.
[200] Zhou Zhao,et al. The Supplementary Material: Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding , 2020 .
[201] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[202] Lin Ma,et al. Temporally Grounding Natural Sentence in Video , 2018, EMNLP.
[203] A. Shapiro. Monte Carlo Sampling Methods , 2003 .
[204] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.