A two-stream heterogeneous network for action recognition based on skeleton and RGB modalities

Recent years, skeleton based action recognition with graph convolutional network (GCN) has achieved great success. However, since skeleton data only includes human body joints coordinates, other key information on actions is missing such as the subtle motion of hands, the objects the human is interacting, leading to an unsatisfactory performance. In this respect, the RGB data can offer help to recognize actions that skeleton-based methods have limitations on. In this work, we propose a novel two-stream heterogeneous network consisting of GCN and CNN networks for action recognition. Specifically, the GCN network takes the skeletal sequence as input to exploit skeleton information. For the RGB video, the CNN model, ResNet (2+1)D, is adapted to exploit RGB information. Afterwards, the discriminant canonical correlation analysis (DCCA) method is utilized to integrate the output feature maps from the skeleton and RGB streams, resulting in improved performance. Experimental results on the large-scale dataset NTU RGB+D show that the proposed model outperforms state-of-the-art models.