论文信息 - Coarse to Fine: Video Retrieval before Moment Localization

Coarse to Fine: Video Retrieval before Moment Localization

The current state-of-the-art methods for video corpus moment retrieval (VCMR) often use similarity-based feature alignment approach for the sake of convenience and speed. However, late fusion methods like cosine similarity alignment are unable to make full use of the information from both query texts and videos. In this paper, we combine feature alignment with feature fusion to promote the performance on VCMR.

Zijian Gao | Huanyu Liu | Jingyu Liu

[1] Quoc V. Le,et al. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[2] Hao Zhang,et al. Span-based Localizing Network for Natural Language Video Localization , 2020, ACL.

[3] Fuzheng Zhang,et al. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer , 2021, ACL.

[4] Mohit Bansal,et al. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[5] Liangli Zhen,et al. Video Corpus Moment Retrieval with Contrastive Learning , 2021, SIGIR.

[6] Ming Zhao,et al. A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus , 2020, ArXiv.

[7] Zhe Gan,et al. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[8] Liangli Zhen,et al. Parallel Attention Network with Sequence Matching for Video Grounding , 2021, FINDINGS.

[9] Zhe Gan,et al. VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.

[10] Bernard Ghanem,et al. Temporal Localization of Moments in Video Collections with Natural Language , 2019, ArXiv.