Continuous Sign Language Recognition Based on Pseudo-supervised Learning

Continuous sign language recognition task is challenging for the reason that the ordered words have no exact temporal locations in the video. Aiming at this problem, we propose a method based on pseudo-supervised learning. First, we use a 3D residual convolutional network (3D-ResNet) pre-trained on the UCF101 dataset to extract visual features. Second, we employ a sequence model with connectionist temporal classification (CTC) loss for learning the mapping between the visual features and sentence-level labels, which can be used to generate clip-level pseudo-labels. Since the CTC objective function has limited effects on visual features extracted from early 3D-ResNet, we fine-tune the 3D-ResNet by feeding the clip-level pseudo-labels and video clips to obtain better feature representation. The feature extractor and the sequence model are optimized alternately with CTC loss. The effectiveness of the proposed method is verified on the large datasets RWTH-PHOENIX-Weather-2014.

[1]  Cristian Sminchisescu,et al.  Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Xilin Chen,et al.  Isolated Sign Language Recognition with Grassmann Covariance Matrices , 2016, TACC.

[3]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Chao Xie,et al.  Chinese sign language recognition with adaptive HMM , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[5]  Hee-Deok Yang,et al.  Sign Language Recognition with the Kinect Sensor Based on Conditional Random Fields , 2014, Sensors.

[6]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[7]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Changshui Zhang,et al.  Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Hermann Ney,et al.  Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs , 2018, International Journal of Computer Vision.

[10]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[11]  Marcel J. T. Reinders,et al.  Sign Language Recognition by Combining Statistical DTW and Independent Classification , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Rita Cucchiara,et al.  Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[15]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[16]  Naresh Kumar Sign language recognition for hearing impaired people based on hands symbols classification , 2017, 2017 International Conference on Computing, Communication and Automation (ICCCA).

[17]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[18]  Ruize Xu,et al.  MEMS Accelerometer Based Nonspecific-User Hand Gesture Recognition , 2012, IEEE Sensors Journal.

[19]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Khaled Assaleh,et al.  Glove-Based Continuous Arabic Sign Language Recognition in User-Dependent Mode , 2015, IEEE Transactions on Human-Machine Systems.

[21]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[22]  Houqiang Li,et al.  Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Houqiang Li,et al.  A Threshold-based HMM-DTW Approach for Continuous Sign Language Recognition , 2014, ICIMCS '14.

[24]  Xilin Chen,et al.  Fast sign language recognition benefited from low rank approximation , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Houqiang Li,et al.  Iterative Alignment Network for Continuous Sign Language Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Hermann Ney,et al.  Deep Sign: Hybrid CNN-HMM for Continuous Sign Language Recognition , 2016, BMVC.

[32]  Houqiang Li,et al.  Dilated Convolutional Network with Iterative Optimization for Continuous Sign Language Recognition , 2018, IJCAI.

[33]  Meng Wang,et al.  Online Early-Late Fusion Based on Adaptive HMM for Sign Language Recognition , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  Pavlo Molchanov,et al.  Hand gesture recognition with 3D convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[37]  Meng Wang,et al.  Connectionist Temporal Fusion for Sign Language Translation , 2018, ACM Multimedia.

[38]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Hsi-Pin Ma,et al.  Gesture recognition with wearable 9-axis sensors , 2017, 2017 IEEE International Conference on Communications (ICC).

[40]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[41]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.