论文信息 - Query - Dependent Video Representation for Moment Retrieval and Highlight Detection

Query - Dependent Video Representation for Moment Retrieval and Highlight Detection

Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR.

Jae-Pil Heo | WonJun Moon | Dongchan Park | Sangeek Hyun | S. Park

[1] Junho Park,et al. Difficulty-Aware Simulator for Open Set Recognition , 2022, ECCV.

[2] Yang Wang,et al. Contrastive Learning for Unsupervised Video Highlight Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Yuning Jiang,et al. Supplementary for Paper: Learning Pixel-Level Distinctions for Video Highlight Detection , 2022 .

[4] C. Schmid,et al. TubeDETR: Spatio-Temporal Video Grounding with Transformers , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Ying Shan,et al. UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Winston H. Hsu,et al. MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] L. Ni,et al. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Hang Su,et al. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR , 2022, ICLR.

[9] Thomas Brox,et al. Ranking Info Noise Contrastive Estimation: Boosting Contrastive Learning via Ranked Positives , 2022, AAAI.

[10] A. Schwing,et al. Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Yitian Yuan,et al. Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Changsheng Xu,et al. Fast Video Moment Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13] Lu Yuan,et al. Dynamic DETR: End-to-End Object Detection with Dynamic Attention , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14] Zirui Wang,et al. Temporal Cue Guided Video Highlight Detection with Low-Rank Audio-Visual Fusion , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Yang Wang,et al. Joint Visual and Audio Learning for Video Highlight Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Bingbing Ni,et al. Cross-category Video Highlight Detection via Set-based Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Depu Meng,et al. Conditional DETR for Fast Training Convergence , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18] Tamara L. Berg,et al. QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries , 2021, ArXiv.

[19] Alexander G. Schwing,et al. Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[20] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[22] Ioannis Patras,et al. Video Summarization Using Deep Neural Networks: A Survey , 2021, Proceedings of the IEEE.

[23] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[24] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[25] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[26] Dimitris N. Metaxas,et al. Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention , 2020, ECCV.

[27] Weishi Zheng,et al. MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection , 2020, ECCV.

[28] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[29] Hao Zhang,et al. Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.

[30] Junnan Li,et al. DivideMix: Learning with Noisy Labels as Semi-supervised Learning , 2020, ICLR.

[31] Mohit Bansal,et al. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[32] Mark D. Plumbley,et al. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33] Jiebo Luo,et al. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language , 2019, AAAI.

[34] Wenhao Jiang,et al. Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction , 2019, AAAI.

[35] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Long Chen,et al. DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization , 2019, EMNLP.

[37] Bernard Ghanem,et al. Temporal Localization of Moments in Video Collections with Natural Language , 2019, ArXiv.

[38] Yu-Gang Jiang,et al. Semantic Proposal for Activity Localization in Videos via Sentence Query , 2019, AAAI.

[39] Liang Wang,et al. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Yannis Kalantidis,et al. Less Is More: Learning Highlight Detection From Video Duration , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Silvio Savarese,et al. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Xiao Liu,et al. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos , 2019, AAAI.

[43] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44] Larry S. Davis,et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Larry S. Davis,et al. Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior , 2018, ECCV.

[46] Meng Liu,et al. Attentive Moment Retrieval in Videos , 2018, SIGIR.

[47] Yang Wang,et al. Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[48] Ting Yao,et al. Deep Learning for Video Classification and Captioning , 2016, Frontiers of Multimedia Research.

[49] Amit K. Roy-Chowdhury,et al. Weakly Supervised Summarization of Web Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52] Michael Lam,et al. Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[54] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[56] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[58] Yale Song,et al. To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos , 2016, CIKM.

[59] Tao Mei,et al. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Ke Zhang,et al. Video Summarization with Long Short-Term Memory , 2016, ECCV.

[61] Yale Song,et al. Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Yale Song,et al. TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Yongdong Zhang,et al. Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[66] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[67] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[68] Ali Farhadi,et al. Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[69] Chih-Jen Lin,et al. Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.