Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos

We address the problem of temporal sentence localization in videos (TSLV). Traditional methods follow a top-down framework which localizes the target segment with predefined segment proposals. Although they have achieved decent performance, the proposals are handcrafted and redundant. Recently, bottom-up framework attracts increasing attention due to its superior efficiency. It directly predicts the probabilities for each frame as a boundary. However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segmentlevel interaction. In this paper, we propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency. Specifically, we first perform a foregroundbackground classification upon the video and regress on the foreground frames to adaptively generate proposals. In this way, the handcrafted proposal design is discarded and the redundant proposals are decreased. Then, a proposal consolidation module is further developed to enhance the semantic of the generated proposals. Finally, we locate the target moments with these generated proposals following the top-down framework. Extensive experiments on three challenging benchmarks show that our proposed APGN significantly outperforms previous state-of-the-art methods.

[1]  Yitian Yuan,et al.  Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[3]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[4]  Ramakant Nevatia,et al.  TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[7]  Jianfeng Dong,et al.  Context-aware Biaffine Localizing Network for Temporal Sentence Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Meng Wang,et al.  Deconfounded Video Moment Retrieval with Causal Intervention , 2021, SIGIR.

[10]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[11]  Jingwen Wang,et al.  Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2020, AAAI.

[12]  Bohyung Han,et al.  Local-Global Video-Text Interactions for Temporal Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Rui Qiao,et al.  Interventional Video Grounding with Dual Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jiebo Luo,et al.  Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[16]  Hao Zhang,et al.  Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.

[17]  Long Chen,et al.  Rethinking the Bottom-Up Framework for Query-Based Video Localization , 2020, AAAI.

[18]  Rabul Hussain Laskar,et al.  Dynamic hand gesture recognition using vision-based approach for human–computer interaction , 2018, Neural Computing and Applications.

[19]  Xiao-Yang Liu,et al.  Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization , 2020, ACM Multimedia.

[20]  Bernt Schiele,et al.  Grounding Action Descriptions in Videos , 2013, TACL.

[21]  Yu Cheng,et al.  Fine-grained Iterative Attention Network for Temporal Language Localization in Videos , 2020, ACM Multimedia.

[22]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[23]  Xirong Li,et al.  Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[25]  Long Chen,et al.  DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization , 2019, EMNLP.

[26]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.

[27]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[28]  Pan Zhou,et al.  Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network , 2020, COLING.

[29]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Runhao Zeng,et al.  Dense Regression Network for Video Grounding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hongdong Li,et al.  Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Xiao-Ming Wu,et al.  Deeper Insights into Graph Convolutional Networks for Semi-Supervised Learning , 2018, AAAI.