Learning Representations from Spatio-Temporal Distance Maps for 3D Action Recognition with Convolutional Neural Networks

This paper addresses the action recognition problem using skeleton data. In this work, a novel method is proposed, which employs five Distance Maps (DM), named as Spatio-Temporal Distance Maps (ST-DMs), to capture the spatio-temporal information from skeleton data for 3D action recognition. Among five DMs, four DMs capture the pose dynamics within a frame in the spatial domain and one DM captures the variations between consecutive frames along the action sequence in the temporal domain. All DMs are encoded into texture images, and Convolutional Neural Network is employed to learn informative features from these texture images for action classification task. Also, a statistical based normalization method is introduced in this proposed method to deal with variable heights of subjects. The efficacy of the proposed method is evaluated on two datasets: UTD MHAD and NTU RGB+D, by achieving recognition accuracies91.63% and 80.36% respectively.