3D sign language recognition with joint distance and angular coded color topographical descriptor on a 2 - stream CNN

Abstract Currently, one of the challenging and most interesting human action recognition (HAR) problems is the 3D sign language recognition problem. The sign in the 3D video can be characterized in the form of 3D joint location information in 3D space over time. Therefore, the objective of this study is to construct a color coded topographical descriptor from joint distances and angles computed from joint locations. We call these two color coded images the joint distance topographic descriptor (JDTD) and joint angle topographical descriptor (JATD) respectively. For the classification we propose a two stream convolutional neural network (2CNN) architecture, which takes as input the color-coded images JDTD and JATD. The two independent streams were merged and concatenated together with features from both streams in the dense layer. For a given query 3D sign (or action), a list of class scores was obtained as a text label corresponding to the sign. The results showed improvement in classifier performance over the predecessors due to the mixing of distance and angular features for predicting closely related spatio temporal discriminative features. To benchmark the performance of our proposed model, we compared our results with the state-of-the-art baseline action recognition frameworks by using our own 3D sign language dataset and two publicly available 3D mocap action datasets, namely, HDM05 and CMU.

