Temporal Cross-Layer Correlation Mining for Action Recognition

Neighboring frames are more correlated compared to frames from further temporal distances. In this paper, we aim to explore the temporal correlations among neighboring frames and exploit cross-layer multi-scale features for action recognition. We present a Temporal Cross-Layer Correlation (TCLC) framework for temporal correlation learning. First, we introduce a context-aware reconstruction block to enable the exploration of neighboring context. This prediction block aims to reconstruct the past frame and the future frame in a unified way. It learns to mine frame correlations and aggregate long sequences at the same time. We demonstrate that this neighborhood mining process enhances the discriminative ability of the network. Second, we propose a novel cross-layer attention and a center-guided attention mechanism to integrate features with contextual knowledge from multiple scales. Our method is a two-stage process for effective cross-layer feature learning. The first stage incorporates the cross-layer attention module to decide the importance weight of the convolutional layers. The second stage leverages the center-guided attention mechanism to aggregate local features from each layer for the generation of a final video representation. We leverage global centers to extract shared semantic knowledge among videos. We evaluate TCLC on three action recognition datasets, i.e., UCF-101, HMDB-51 and Kinetics. Our experimental results demonstrate the superiority of our proposed temporal correlation mining method.