Fully Convolutional Networks for Continuous Sign Language Recognition

Continuous sign language recognition (SLR) is a challenging task that requires learning on both spatial and temporal dimensions of signing frame sequences. Most recent work accomplishes this by using CNN and RNN hybrid networks. However, training these networks is generally non-trivial, and most of them fail in learning unseen sequence patterns, causing an unsatisfactory performance for online recognition. In this paper, we propose a fully convolutional network (FCN) for online SLR to concurrently learn spatial and temporal features from weakly annotated video sequences with only sentence-level annotations given. A gloss feature enhancement (GFE) module is introduced in the proposed network to enforce better sequence alignment learning. The proposed network is end-to-end trainable without any pre-training. We conduct experiments on two large scale SLR datasets. Experiments show that our method for continuous SLR is effective and performs well in online recognition.

[1]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[2]  Joan Puigcerver,et al.  Are Multidimensional Recurrent Layers Really Necessary for Handwritten Text Recognition? , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[3]  Petros Maragos,et al.  Product-HMMs for automatic sign language recognition , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Wei Liu,et al.  Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Houqiang Li,et al.  Iterative Alignment Network for Continuous Sign Language Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xilin Chen,et al.  Weakly Supervised Metric Learning towards Signer Adaptation for Sign Language Recognition , 2015, BMVC.

[9]  Daniel Kelly,et al.  Recognizing Spatiotemporal Gestures and Movement Epenthesis in Sign Language , 2009, 2009 13th International Machine Vision and Image Processing Conference.

[10]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  George Awad,et al.  Modelling and segmenting subunits for sign language recognition based on hand motion analysis , 2009, Pattern Recognit. Lett..

[12]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Sergio Escalera,et al.  Probability-based Dynamic Time Warping and Bag-of-Visual-and-Depth-Words for Human Gesture Recognition in RGB-D , 2014, Pattern Recognit. Lett..

[14]  Meng Wang,et al.  Sign language recognition based on adaptive HMMS with data augmentation , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[15]  Zhongfu Ye,et al.  Continuous sign language recognition using level building based on fast hidden Markov model , 2016, Pattern Recognit. Lett..

[16]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[17]  Helen Cooper,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, CVPR.

[18]  Dan I. Slobin,et al.  Grammar, Gestures, and Meaning in American Sign Language (review) , 2006 .

[19]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[20]  Johan Schalkwyk,et al.  Learning acoustic frame labeling for speech recognition with recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[22]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Changshui Zhang,et al.  Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hermann Ney,et al.  Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Marcus Liwicki,et al.  A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks , 2007 .

[26]  Marcus Rohrbach,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[27]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Changshui Zhang,et al.  A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training , 2019, IEEE Transactions on Multimedia.

[29]  Petros Maragos,et al.  Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition , 2011, CVPR 2011 WORKSHOPS.

[30]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[32]  Ruiduo Yang,et al.  Gesture Recognition using Hidden Markov Models from Fragmented Observations , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[33]  Zhaoyang Yang,et al.  SF-Net: Structured Feature Network for Continuous Sign Language Recognition , 2019, ArXiv.

[34]  Radu Horaud,et al.  Continuous Gesture Recognition from Articulated Poses , 2014, ECCV Workshops.

[35]  Meng Wang,et al.  Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[36]  Chao Xie,et al.  Chinese sign language recognition with adaptive HMM , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[37]  Changsheng Xu,et al.  Discriminative Exemplar Coding for Sign Language Recognition With Kinect , 2013, IEEE Transactions on Cybernetics.

[38]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[39]  Wen Gao,et al.  A SRN/HMM system for signer-independent continuous sign language recognition , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[40]  Hermann Ney,et al.  Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition , 2016, BMVC.

[41]  Ruiduo Yang,et al.  Detecting Coarticulation in Sign Language using Conditional Random Fields , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[42]  Seong-Whan Lee,et al.  Robust Sign Language Recognition with Hierarchical Conditional Random Fields , 2010, 2010 20th International Conference on Pattern Recognition.

[43]  Ali Farhadi,et al.  Aligning ASL for Statistical Translation Using a Discriminative Word Model , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[44]  Houqiang Li,et al.  A Threshold-based HMM-DTW Approach for Continuous Sign Language Recognition , 2014, ICIMCS '14.

[45]  Surendra Ranganath,et al.  Automatic Sign Language Analysis: A Survey and the Future beyond Lexical Meaning , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  Houqiang Li,et al.  Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition , 2020, AAAI.

[47]  Houqiang Li,et al.  Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition , 2018, IJCAI.

[48]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.