Object Tracking via Spatial-Temporal Memory Network

Temporal and spatial contexts, characterizing target appearance variations and target-background differences, respectively, are crucial for improving the online adaptive ability and instance-level discriminative ability of object tracking. However, most existing trackers focus on either the temporal context or the spatial context during tracking and have not exploited these contexts simultaneously and effectively. In this paper, we propose a Spatial-TEmporal Memory (STEM) network to exploit these contexts jointly for object tracking. Specifically, we develop a key-value structured memory model equipped with a key-value index-based memory reading mechanism to model the spatial and temporal contexts simultaneously. To update the memory with new target states and ensure the diversity of the memory, we introduce a similarity-aware memory update scheme. In addition, we construct an entropy-guided ensemble strategy to fuse the prediction models based on these two contexts, such that these two contexts can be exploited to estimate the target state jointly. Extensive experimental results on eight challenging datasets, including OTB2015, TC128, UAV123, VOT2018, LaSOT, TrackingNet, GOT-10k, and OxUvA, demonstrate that the proposed method performs favorably against state-of-the-art trackers.