Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding

Weakly supervised temporal sentence grounding has better scalability and practicability than fully supervised methods in real-world application scenarios. However, most of existing methods cannot model the fine-grained video-text local correspondences well and do not have effective supervision information for correspondence learning, thus yielding unsatisfying performance. To address the above issues, we propose an end-to-end Local Correspondence Network (LCNet) for weakly supervised temporal sentence grounding. The proposed LCNet enjoys several merits. First, we represent video and text features in a hierarchical manner to model the fine-grained video-text correspondences. Second, we design a self-supervised cycle-consistent loss as a learning guidance for video and text matching. To the best of our knowledge, this is the first work to fully explore the fine-grained correspondences between video and text for temporal sentence grounding by using self-supervised learning. Extensive experimental results on two benchmark datasets demonstrate that the proposed LCNet significantly outperforms existing weakly supervised methods.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Bohyung Han,et al.  Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Jiebo Luo,et al.  Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[5]  Sergey Levine,et al.  Imitation from Observation: Learning to Imitate Behaviors from Raw Video via Context Translation , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Jiebo Luo,et al.  Utilizing semantic word similarity measures for video retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  Wenhan Luo,et al.  Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video , 2020, ArXiv.

[11]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[13]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[14]  Jungong Han,et al.  Saliency-Guided Attention Network for Image-Sentence Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[16]  Qingming Huang,et al.  Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Runhao Zeng,et al.  Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Liqiang Nie,et al.  Neural Multimodal Cooperative Learning Toward Micro-Video Understanding , 2020, IEEE Transactions on Image Processing.

[20]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.

[21]  Hongdong Li,et al.  Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  Shizhe Chen,et al.  Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yu-Gang Jiang,et al.  Semantic Proposal for Activity Localization in Videos via Sentence Query , 2019, AAAI.

[24]  Zhou Yu,et al.  Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos , 2020, ArXiv.

[25]  Guanbin Li,et al.  Linguistic Structure Guided Context Modeling for Referring Image Segmentation , 2020, ECCV.

[26]  Liang Wang,et al.  Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiao Liu,et al.  Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos , 2019, AAAI.

[28]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[29]  Chen Qian,et al.  A Real-Time Cross-Modality Correlation Filtering Method for Referring Expression Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Zhijie Lin,et al.  Weakly-Supervised Video Moment Retrieval via Semantic Completion Network , 2020, AAAI.

[31]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Tao Mei,et al.  Joint multi-label multi-instance learning for image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Ji Liu,et al.  IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Larry S. Davis,et al.  WSLLN:Weakly Supervised Natural Language Localization Networks , 2019, EMNLP.

[37]  Yitian Yuan,et al.  Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Max Welling,et al.  Attention-based Deep Multiple Instance Learning , 2018, ICML.

[40]  Jungong Han,et al.  SMAN: Stacked Multimodal Attention Network for Cross-Modal Image–Text Retrieval , 2020, IEEE Transactions on Cybernetics.

[41]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[42]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Yongdong Zhang,et al.  Multi-Modality Cross Attention Network for Image and Sentence Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[45]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Shih-Fu Chang,et al.  Grounding Referring Expressions in Images by Variational Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Yunchao Wei,et al.  Referring Image Segmentation via Cross-Modal Progressive Comprehension , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.