Video Recognition of American Sign Language Using Two-Stream Convolution Neural Networks

Sign language uses manual-visual to convey meaning. The style is expressed through manual sign flow in combination with non-manual elements. Sign gestures interpreted in the meaning of words, letters, and numbers. This study proposed Two-stream Convolutional Neural Networks (CNN) to recognize and classify words in hand motion images of video form. Two-stream CNN works with two processes, namely spatial and temporal stream. Spatial flow detects edges and overall global features. While temporal flow identifies local action features in stacked optical flow images of 10 frames, each stream passed Softmax function. Average Fusion function combines both of streams. Two-stream separated training reduced computing time and overcome resource limitations. In building a CNN two-stream model, a specific configuration is needed to update the weight during training such as VGG – SGD, Resnet – Adam, Resnet – SGD, Xceptionnet – Adam, and Xceptionnet – SGD. The result gave the best precision used Xceptionnet SGD of spatial flow and Xceptionnet Adam of temporal flow configuration. The architecture gave precision 89.4% of a combination of one choice or Top1 is 89.4% and 99.4% of the five choices or Top5.

[1]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Nuraini Putri Permatasari SISTEM PENERJEMAH BAHASA ISYARAT MENGGUNAKAN METODE CONVOLUTIONAL NEURAL NETWORK (CNN) BERBASIS SENSOR 2.5D , 2016 .

[3]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Takashi Matsubara,et al.  Data Augmentation Using Random Image Cropping and Patching for Deep CNNs , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[5]  Stan Sclaroff,et al.  The American Sign Language Lexicon Video Dataset , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Jeffrey S. Vetter,et al.  NVIDIA Tensor Core Programmability, Performance & Precision , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[8]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Geoff S. Nitschke,et al.  Improving Deep Learning with Generic Data Augmentation , 2018, 2018 IEEE Symposium Series on Computational Intelligence (SSCI).

[10]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[11]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  C.-C. Jay Kuo Understanding convolutional neural networks with a mathematical model , 2016, J. Vis. Commun. Image Represent..

[14]  Xiaoou Tang,et al.  LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).