Video-based Sign Language Recognition without Temporal Segmentation

Millions of hearing impaired people around the world routinely use some variants of sign languages to communicate, thus the automatic translation of a sign language is meaningful and important. Currently, there are two sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that recognizes word by word and continuous SLR that translates entire sentences. Existing continuous SLR methods typically utilize isolated SLRs as building blocks, with an extra layer of preprocessing (temporal segmentation) and another layer of post-processing (sentence synthesis). Unfortunately, temporal segmentation itself is non-trivial and inevitably propagates errors into subsequent steps. Worse still, isolated SLR methods typically require strenuous labeling of each word separately in a sentence, severely limiting the amount of attainable training data. To address these challenges, we propose a novel continuous sign recognition framework, the Hierarchical Attention Network with Latent Space (LS-HAN), which eliminates the preprocessing of temporal segmentation. The proposed LS-HAN consists of three components: a two-stream Convolutional Neural Network (CNN) for video feature representation generation, a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention Network (HAN) for latent space based recognition. Experiments are carried out on two large scale datasets. Experimental results demonstrate the effectiveness of the proposed framework.

[1]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Houqiang Li,et al.  Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[3]  Yanning Zhang,et al.  Convolutional Neural Network-Based Robot Navigation Using Uncalibrated Spherical Images , 2017, Sensors.

[4]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[6]  Gang Hua,et al.  Multi-View Visual Recognition of Imperfect Testing Data , 2015, ACM Multimedia.

[7]  Jian Li,et al.  Fast implementation of sparse iterative covariance-based estimation for array processing , 2011, 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[8]  Fatos Xhafa,et al.  Learning Structure and Schemas from Documents , 2011, Studies in Computational Intelligence.

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  Zhongfu Ye,et al.  Continuous sign language recognition using level building based on fast hidden Markov model , 2016, Pattern Recognit. Lett..

[11]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[12]  Meng Wang,et al.  Sign language recognition based on adaptive HMMS with data augmentation , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[13]  Hermann Ney,et al.  Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jian Li,et al.  Iterative Sparse Asymptotic Minimum Variance Based Approaches for Array Processing , 2013, IEEE Transactions on Signal Processing.

[15]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Lei Zhang,et al.  Real-Time Compressive Tracking , 2012, ECCV.

[17]  Yang Yang,et al.  Bidirectional Long-Short Term Memory for Video Description , 2016, ACM Multimedia.

[18]  Trevor Darrell,et al.  Latent-Dynamic Discriminative Models for Continuous Gesture Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Houqiang Li,et al.  Sign language recognition with long short-term memory , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[20]  Wen Gao,et al.  Large-Vocabulary Continuous Sign Language Recognition Based on Transition-Movement Models , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[21]  Gang Hua,et al.  Auxiliary Training Information Assisted Visual Recognition , 2015, IPSJ Trans. Comput. Vis. Appl..

[22]  Houqiang Li,et al.  Sign Language Recognition with Multi-modal Features , 2016, PCM.

[23]  Chin-Hui Lee,et al.  Sign Transition Modeling and a Scalable Solution to Continuous Sign Language Recognition for Real-World Applications , 2016, ACM Trans. Access. Comput..

[24]  Chao Xie,et al.  Chinese sign language recognition with adaptive HMM , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[25]  Lei Wu,et al.  Effective Active Skeleton Representation for Low Latency Human Action Recognition , 2016, IEEE Transactions on Multimedia.

[26]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[27]  Wei Wei,et al.  A Hyperspectral Image Classification Framework with Spatial Pixel Pair Features , 2017, Sensors.

[28]  Ao Tang,et al.  A Real-Time Hand Posture Recognition System Using Deep Neural Networks , 2015, ACM Trans. Intell. Syst. Technol..

[29]  Changshui Zhang,et al.  Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Changsheng Xu,et al.  Discriminative Exemplar Coding for Sign Language Recognition With Kinect , 2013, IEEE Transactions on Cybernetics.

[31]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[32]  Houqiang Li,et al.  A Threshold-based HMM-DTW Approach for Continuous Sign Language Recognition , 2014, ICIMCS '14.

[33]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Gang Hua,et al.  Can Visual Recognition Benefit from Auxiliary Information in Training? , 2014, ACCV.

[35]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[36]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[37]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Z. Liu,et al.  A real time system for dynamic hand gesture recognition with a depth sensor , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[39]  Jian Li,et al.  Fast implementation of sparse iterative covariance-based estimation for source localization. , 2012, The Journal of the Acoustical Society of America.

[40]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Chong-Wah Ngo,et al.  Click-through-based Subspace Learning for Image Search , 2014, ACM Multimedia.