Sign Language Translation with Hierarchical Spatio-Temporal Graph Neural Network

Sign language translation (SLT), which generates text in a spoken language from visual content in a sign language, is important to assist the hard-of-hearing community for their communications. Inspired by neural machine translation (NMT), most existing SLT studies adopted a general sequence to sequence learning strategy. However, SLT is significantly different from general NMT tasks since sign languages convey messages through multiple visual-manual aspects. Therefore, in this paper, these unique characteristics of sign languages are formulated as hierarchical spatio-temporal graph representations, including high-level and fine-level graphs of which a vertex characterizes a specified body part and an edge represents their interactions. Particularly, high-level graphs represent the patterns in the regions such as hands and face, and fine-level graphs consider the joints of hands and landmarks of facial regions. To learn these graph patterns, a novel deep learning architecture, namely hierarchical spatio-temporal graph neural network (HST-GNN), is proposed. Graph convolutions and graph self-attentions with neighborhood context are proposed to characterize both the local and the global graph properties. Experimental results on benchmark datasets demonstrated the effectiveness of the proposed method.

[1]  Kirsti Grobel,et al.  Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Horst Bischof,et al.  A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[5]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[6]  Luc Van Gool,et al.  Real-time sign language letter and word recognition from depth data , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[7]  Nicolas Pugeault,et al.  Sign language recognition using sub-units , 2012, J. Mach. Learn. Res..

[8]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[9]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[10]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Hermann Ney,et al.  Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xilin Chen,et al.  Continuous Gesture Recognition with Hand-Oriented Spatiotemporal Feature , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[14]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[16]  Hermann Ney,et al.  Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs , 2018, International Journal of Computer Vision.

[17]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[19]  Meng Wang,et al.  Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[20]  Houqiang Li,et al.  Iterative Alignment Network for Continuous Sign Language Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Sang-Ki Ko,et al.  Neural Sign Language Translation based on Human Keypoint Estimation , 2018, Applied Sciences.

[22]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[23]  H. Kashima,et al.  Approximation Ratios of Graph Neural Networks for Combinatorial Problems , 2019, NeurIPS.

[24]  Shuo Wang,et al.  Dense Temporal Convolution Network for Sign Language Translation , 2019, IJCAI.

[25]  Houqiang Li,et al.  Dynamic Pseudo Label Decoding for Continuous Sign Language Recognition , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[26]  Juan Song,et al.  Continuous Gesture Segmentation and Recognition Using 3DCNN and Convolutional LSTM , 2019, IEEE Transactions on Multimedia.

[27]  Changshui Zhang,et al.  A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training , 2019, IEEE Transactions on Multimedia.

[28]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jesse Read,et al.  Better Sign Language Translation with STMC-Transformer , 2020, COLING.

[30]  Stephan Günnemann,et al.  Directional Message Passing for Molecular Graphs , 2020, ICLR.

[31]  Hongdong Li,et al.  Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32]  Hermann Ney,et al.  Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Houqiang Li,et al.  Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition , 2020, AAAI.

[34]  Michèle Gouiffès,et al.  Automatic Segmentation of Sign Language into Subtitle-Units , 2020, ECCV Workshops.

[35]  Kayo Yin,et al.  Sign Language Translation with Transformers , 2020, ArXiv.

[36]  Liang Wang,et al.  Graph Sequence Recurrent Neural Network for Vision-Based Freezing of Gait Detection , 2020, IEEE Transactions on Image Processing.

[37]  Houqiang Li,et al.  Boosting Continuous Sign Language Recognition via Cross Modality Augmentation , 2020, ACM Multimedia.

[38]  Mitesh M. Khapra,et al.  INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition , 2020, ACM Multimedia.

[39]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Parth H. Pathak,et al.  Hand Pose Guided 3D Pooling for Word-level Sign Language Recognition , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  A. Tsoi,et al.  Graph Fusion Network-Based Multimodal Learning for Freezing of Gait Detection , 2021, IEEE Transactions on Neural Networks and Learning Systems.