论文信息 - Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding

Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding

Weakly supervised temporal sentence grounding has better scalability and practicability than fully supervised methods in real-world application scenarios. However, most of existing methods cannot model the fine-grained video-text local correspondences well and do not have effective supervision information for correspondence learning, thus yielding unsatisfying performance. To address the above issues, we propose an end-to-end Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding. The proposed LCNet enjoys several merits. First, we represent video and text features in a hierarchical manner to model the fine-grained video-text correspondences. Second, we design a self-supervised cycle-consistent loss as a learning guidance for video and text matching. To the best of our knowledge, this is the first work to fully explore the fine-grained correspondences between video and text for temporal sentence grounding by using self-supervised learning. Extensive experimental results on two benchmark datasets demonstrate that the proposed LCNet significantly outperforms existing weakly supervised methods.

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Bohyung Han,et al. Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Trevor Darrell,et al. Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Jiebo Luo,et al. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[5] Sergey Levine,et al. Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[6] Jiebo Luo,et al. Utilizing semantic word similarity measures for video retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10] Wenhan Luo,et al. Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video , 2020, ArXiv.

[11] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12] John Shawe-Taylor,et al. Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[13] Lin Ma,et al. Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[14] Jungong Han,et al. Saliency-Guided Attention Network for Image-Sentence Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Zhou Zhao,et al. Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[16] Qingming Huang,et al. Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Runhao Zeng,et al. Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Larry S. Davis,et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Liqiang Nie,et al. Neural Multimodal Cooperative Learning Toward Micro-Video Understanding , 2020, IEEE Transactions on Image Processing.

[20] Meng Liu,et al. Attentive Moment Retrieval in Videos , 2018, SIGIR.

[21] Hongdong Li,et al. Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Yu-Gang Jiang,et al. Semantic Proposal for Activity Localization in Videos via Sentence Query , 2019, AAAI.

[24] Zhou Yu,et al. Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos , 2020, ArXiv.

[25] Guanbin Li,et al. Linguistic Structure Guided Context Modeling for Referring Image Segmentation , 2020, ECCV.

[26] Liang Wang,et al. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Xiao Liu,et al. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos , 2019, AAAI.

[28] Tao Mei,et al. To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[29] Chen Qian,et al. A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Zhijie Lin,et al. Weakly-Supervised Video Moment Retrieval via Semantic Completion Network , 2020, AAAI.

[31] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32] Tao Mei,et al. Joint multi-label multi-instance learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Ji Liu,et al. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Larry S. Davis,et al. WSLLN:Weakly Supervised Natural Language Localization Networks , 2019, EMNLP.

[37] Yitian Yuan,et al. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38] Amit K. Roy-Chowdhury,et al. Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Max Welling,et al. Attention-based Deep Multiple Instance Learning , 2018, ICML.

[40] Jungong Han,et al. SMAN: Stacked Multimodal Attention Network for Cross-Modal Image–Text Retrieval , 2020, IEEE Transactions on Cybernetics.

[41] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[42] Hao Chen,et al. FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] Yongdong Zhang,et al. Multi-Modality Cross Attention Network for Image and Sentence Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[45] Subhransu Maji,et al. Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46] Shih-Fu Chang,et al. Grounding Referring Expressions in Images by Variational Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47] Yunchao Wei,et al. Referring Image Segmentation via Cross-Modal Progressive Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Kate Saenko,et al. Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.