D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space. Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments. Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods. Code is available at https://github.com/solicucu/D3G.

[1]  Ya Zhang,et al.  Constraint and Union for Partially-Supervised Temporal Sentence Grounding , 2023, ArXiv.

[2]  Yuechen Wang,et al.  Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding , 2022, EMNLP.

[3]  Yang Liu,et al.  Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining , 2022, AAAI.

[4]  Yuxin Peng,et al.  Weakly Supervised Temporal Sentence Grounding with Gaussian-based Contrastive Proposal Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yu-Gang Jiang,et al.  Video Moment Retrieval from Text Queries via Single Frame Annotation , 2022, SIGIR.

[6]  C. Schmid,et al.  TubeDETR: Spatio-Temporal Video Grounding with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ke Yan,et al.  SIOD: Single Instance Annotated Per Category Per Image for Object Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Tat-Seng Chua,et al.  Video Moment Retrieval With Cross-Modal Neural Architecture Search , 2022, IEEE Transactions on Image Processing.

[9]  Tianhao Li,et al.  Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding , 2021, AAAI.

[10]  Shiwei Zhang,et al.  Support-Set Based Cross-Supervision for Video Grounding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Shaogang Gong,et al.  Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Yu-Gang Jiang,et al.  Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Rui Qiao,et al.  Interventional Video Grounding with Dual Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zhengjun Zha,et al.  Structured Multi-Level Interaction Network for Video Moment Localization via Language Query , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Heng Tao Shen,et al.  Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wei Ji,et al.  Boundary Proposal Network for Two-Stage Natural Language Video Localization , 2021, AAAI.

[17]  Mingsheng Long,et al.  Self-Tuning for Data-Efficient Deep Learning , 2021, ICML.

[18]  Yongdong Zhang,et al.  Local Correspondence Network for Weakly Supervised Temporal Sentence Grounding , 2021, IEEE Transactions on Image Processing.

[19]  Bohyung Han,et al.  Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Zhou Yu,et al.  Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos , 2020, ArXiv.

[21]  Wenhan Luo,et al.  Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video , 2020, ArXiv.

[22]  Jiebo Luo,et al.  Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[23]  Zhou Zhao,et al.  Weakly-Supervised Video Moment Retrieval via Semantic Completion Network , 2019, AAAI.

[24]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[25]  Kate Saenko,et al.  LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Wenhao Jiang,et al.  Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2019, AAAI.

[27]  Larry S. Davis,et al.  WSLLN:Weakly Supervised Natural Language Localization Networks , 2019, EMNLP.

[28]  Jiebo Luo,et al.  Localizing Natural Language in Videos , 2019, AAAI.

[29]  Bin Jiang,et al.  Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention , 2019, ICMR.

[30]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Chuang Gan,et al.  Weakly Supervised Dense Event Captioning in Videos , 2018, NeurIPS.

[32]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ramakant Nevatia,et al.  MAC: Mining Activity Concepts for Language-Based Temporal Localization , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[34]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[35]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[36]  Zhetao Li,et al.  Three-Dimensional Attention-Based Deep Ranking Model for Video Highlight Detection , 2018, IEEE Transactions on Multimedia.

[37]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[41]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[43]  Bernt Schiele,et al.  Script Data for Attribute-Based Recognition of Composite Activities , 2012, ECCV.

[44]  Kun-Juan Wei,et al.  Point-Supervised Video Temporal Grounding , 2023, IEEE Transactions on Multimedia.

[45]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.