Collaborative Spatial-Temporal Interaction for Language-Based Moment Retrieval

Language-based moment retrieval attempts to distinguish the most related video moment semantically corresponding to the language query. The core of this task not only includes a mutual comprehension of query semantics and video details, but also requires accurately excavating the location information from both temporal and spatial dimensions. Unfortunately, existing methods fail to consider the fine-grained relationship of intrinsic spatial-temporal information and language query. In this work, we introduce a collaborative spatial-temporal interaction (CSTI) model to explore the complicated alignment patterns between visual and linguistic features. Firstly, we present a video-enhanced query attention block to improve language understanding, which summarizes the frame features to calculate a compact video abstract for every query word utilizing the attention mechanism. Secondly, we develop a cross-modal semantic modulation block, which decomposes the video-enhanced query feature into spatial and temporal-relevant linguistic parts to conduct the mining of context-aware visual location evidence in the specific dimension. Finally, we employ a visual gate on every frame to implement the distinct influences of spatial and temporal-relevant query features. Experimental evaluations on two popular benchmark datasets suggest that our model exceeds the state-of-the-arts by a clear margin.