Event-aware Video Corpus Moment Retrieval

Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos based on maximum frame similarity.However, this approach overlooks the semantic structure embedded within the information between frames, namely, the event, a crucial element for human comprehension of videos. Motivated by this, we propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval. The model extracts event representations through event reasoning and hierarchical event encoding. The event reasoning module groups consecutive and visually similar frame representations into events, while the hierarchical event encoding encodes information at both the frame and event levels. We also introduce anchor multi-head self-attenion to encourage Transformer to capture the relevance of adjacent content in the video. The training of EventFormer is conducted by two-branch contrastive learning and dual optimization for two sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo benchmarks show the effectiveness and efficiency of EventFormer in VCMR, achieving new state-of-the-art results. Additionally, the effectiveness of EventFormer is also validated on partially relevant video retrieval task.

[1]  Jinpeng Wang,et al.  GMMFormer: Gaussian-Mixture-Model based Transformer for Efficient Partially Relevant Video Retrieval , 2023, AAAI.

[2]  Fumin Shen,et al.  Progressive Event Alignment Network for Partial Relevant Video Retrieval , 2023, 2023 IEEE International Conference on Multimedia and Expo (ICME).

[3]  Jae-Pil Heo,et al.  Query - Dependent Video Representation for Moment Retrieval and Highlight Detection , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Dahyun Kim,et al.  Selective Query-Guided Debiasing for Video Corpus Moment Retrieval , 2022, ECCV.

[5]  Tan Yu,et al.  Cross-Probe BERT for Fast Cross-Modal Search , 2022, SIGIR.

[6]  Seon Joo Kim,et al.  UBoCo: Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Tsu-Jui Fu,et al.  VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.

[8]  Guangyi Xiao,et al.  Fine-grained Cross-modal Alignment Network for Text-Video Retrieval , 2021, ACM Multimedia.

[9]  Dmytro Okhonko,et al.  VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[10]  Jun Xiao,et al.  Natural Language Video Localization with Learnable Moment Proposals , 2021, EMNLP.

[11]  Chong-Wah Ngo,et al.  CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval , 2021, ACM Multimedia.

[12]  Mike Zheng Shou,et al.  On Pursuit of Designing Multi-modal Transformer for Video Grounding , 2021, EMNLP.

[13]  Heng Tao Shen,et al.  Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Florian Metze,et al.  VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding , 2021, FINDINGS.

[15]  Dan Guo,et al.  Proposal-Free Video Grounding with Contextual Pyramid Network , 2021, AAAI.

[16]  Liangli Zhen,et al.  Video Corpus Moment Retrieval with Contrastive Learning , 2021, SIGIR.

[17]  Linchao Zhu,et al.  T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Ivan Laptev,et al.  Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jianfeng Dong,et al.  Context-aware Biaffine Localizing Network for Temporal Sentence Grounding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Weiyao Wang,et al.  Generic Event Boundary Detection: A Benchmark for Event Segmentation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Ming Zhao,et al.  A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus , 2020, ArXiv.

[24]  Thomas Brox,et al.  COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning , 2020, NeurIPS.

[25]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[26]  James R. Glass,et al.  AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2020, Interspeech.

[27]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[28]  Licheng Yu,et al.  Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[29]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[30]  Long Chen,et al.  Rethinking the Bottom-Up Framework for Query-Based Video Localization , 2020, AAAI.

[31]  Hao Zhang,et al.  Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.

[32]  Shizhe Chen,et al.  Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[34]  Mohit Bansal,et al.  TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[35]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[37]  Yu-Gang Jiang,et al.  Semantic Proposal for Activity Localization in Videos via Sentence Query , 2019, AAAI.

[38]  Jiebo Luo,et al.  Localizing Natural Language in Videos , 2019, AAAI.

[39]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[44]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[45]  Tao Mei,et al.  To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.

[46]  Kate Saenko,et al.  Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.

[47]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[48]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[49]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[54]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[56]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Hugo Larochelle,et al.  A Neural Autoregressive Topic Model , 2012, NIPS.

[58]  Jeffrey M. Zacks,et al.  Event perception , 2011, Scholarpedia.

[59]  David L. Chen,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[60]  Bingshu Wang,et al.  Cross-Modality Knowledge Calibration Network for Video Corpus Moment Retrieval , 2024, IEEE Transactions on Multimedia.

[61]  Yilong Yin,et al.  Video Corpus Moment Retrieval via Deformable Multigranularity Feature Fusion and Adversarial Training , 2023, IEEE Transactions on Circuits and Systems for Video Technology.

[62]  Ping Li,et al.  Inflate and Shrink:Enriching and Reducing Interactions for Fast Text-Image Retrieval , 2021, EMNLP.

[63]  Jie Lei,et al.  Detecting Moments and Highlights in Videos via Natural Language Queries , 2021, NeurIPS.

[64]  Lin Ma,et al.  Temporally Grounding Natural Sentence in Video , 2018, EMNLP.