Skeleton-Aware Neural Sign Language Translation

As an essential communication way for deaf-mutes, sign languages are expressed by human actions. To distinguish human actions for sign language understanding, the skeleton which contains position information of human pose can provide an important cue, since different actions usually correspond to different poses/skeletons. However, skeleton has not been fully studied for Sign Language Translation (SLT), especially for end-to-end SLT. Therefore, in this paper, we propose a novel end-to-end Skeleton-Aware neural Network (SANet) for video-based SLT. Specifically, to achieve end-to-end SLT, we design a self-contained branch for skeleton extraction. To efficiently guide the feature extraction from video with skeletons, we concatenate the skeleton channel and RGB channels of each frame for feature extraction. To distinguish the importance of clips, we construct a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. The scaled features of each clip are then sent to a decoder module to generate spoken language. In our SANet, a joint training strategy is designed to optimize skeleton extraction and sign language translation jointly. Experimental results on two large scale SLT datasets demonstrate the effectiveness of our approach, which outperforms the state-of-the-art methods. Our code is available at https://github.com/SignLanguageCode/SANet.

[1]  Shuo Wang,et al.  Dense Temporal Convolution Network for Sign Language Translation , 2019, IJCAI.

[2]  Lei Shi,et al.  Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Houqiang Li,et al.  Boosting Continuous Sign Language Recognition via Cross Modality Augmentation , 2020, ACM Multimedia.

[4]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[5]  Xavier Giro-i-Nieto,et al.  How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Oscar Koller,et al.  Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.

[8]  Houqiang Li,et al.  A Threshold-based HMM-DTW Approach for Continuous Sign Language Recognition , 2014, ICIMCS '14.

[9]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Dongxu Li,et al.  TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation , 2020, NeurIPS.

[11]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[13]  Houqiang Li,et al.  Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition , 2020, AAAI.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Hermann Ney,et al.  Statistical Sign Language Translation , 2004 .

[16]  Lale Akarun,et al.  Neural Sign Language Translation by Learning Tokenization , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[17]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[18]  Houqiang Li,et al.  Sign language recognition with long short-term memory , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[19]  Wengang Zhou,et al.  Spatial-Temporal Multi-Cue Network for Sign Language Recognition and Translation , 2021, IEEE Transactions on Multimedia.

[20]  Ao Tang,et al.  A Real-Time Hand Posture Recognition System Using Deep Neural Networks , 2015, ACM Trans. Intell. Syst. Technol..

[21]  Zhaoyang Yang,et al.  SF-Net: Structured Feature Network for Continuous Sign Language Recognition , 2019, ArXiv.

[22]  Hermann Ney,et al.  Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[24]  Houqiang Li,et al.  Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  Mieke Van Herreweghe,et al.  Gesture and Sign Language Recognition with Temporal Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[26]  Guang Li,et al.  Sign Language Recognition and Translation with Kinect , 2013 .

[27]  Houqiang Li,et al.  Iterative Alignment Network for Continuous Sign Language Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Kirsti Grobel,et al.  Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[29]  Ying Xie,et al.  American Sign Language Recognition using Deep Learning and Computer Vision , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[30]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32]  Meng Wang,et al.  Sign language recognition based on adaptive HMMS with data augmentation , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[33]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Luc Van Gool,et al.  Real-time sign language letter and word recognition from depth data , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[35]  Changshui Zhang,et al.  A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training , 2019, IEEE Transactions on Multimedia.

[36]  Houqiang Li,et al.  Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation , 2020, IEEE Transactions on Image Processing.

[37]  Hermann Ney,et al.  Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[39]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[40]  Alex ChiChung Kot,et al.  Collaborative Learning of Gesture Recognition and 3D Hand Pose Estimation with Multi-order Feature Analysis , 2020, ECCV.

[41]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[44]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[46]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).