论文信息 - Sign Language Translation with Hierarchical Spatio-Temporal Graph Neural Network

Sign Language Translation with Hierarchical Spatio-Temporal Graph Neural Network

Sign language translation (SLT), which generates text in a spoken language from visual content in a sign language, is important to assist the hard-of-hearing community for their communications. Inspired by neural machine translation (NMT), most existing SLT studies adopted a general sequence to sequence learning strategy. However, SLT is significantly different from general NMT tasks since sign languages convey messages through multiple visual-manual aspects. Therefore, in this paper, these unique characteristics of sign languages are formulated as hierarchical spatio-temporal graph representations, including high-level and fine-level graphs of which a vertex characterizes a specified body part and an edge represents their interactions. Particularly, high-level graphs represent the patterns in the regions such as hands and face, and fine-level graphs consider the joints of hands and landmarks of facial regions. To learn these graph patterns, a novel deep learning architecture, namely hierarchical spatio-temporal graph neural network (HST-GNN), is proposed. Graph convolutions and graph self-attentions with neighborhood context are proposed to characterize both the local and the global graph properties. Experimental results on benchmark datasets demonstrated the effectiveness of the proposed method.

[1] Kirsti Grobel,et al. Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[2] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4] Horst Bischof,et al. A Duality Based Approach for Realtime TV-L1 Optical Flow , 2007, DAGM-Symposium.

[5] Ah Chung Tsoi,et al. The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[6] Luc Van Gool,et al. Real-time sign language letter and word recognition from depth data , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[7] Nicolas Pugeault,et al. Sign language recognition using sub-units , 2012, J. Mach. Learn. Res..

[8] Joan Bruna,et al. Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[9] Hermann Ney,et al. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[10] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[11] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Hermann Ney,et al. Deep Hand: How to Train a CNN on 1 Million Hand Images When Your Data is Continuous and Weakly Labelled , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Xilin Chen,et al. Continuous Gesture Recognition with Hand-Oriented Spatiotemporal Feature , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[14] Oscar Koller,et al. SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15] Pietro Liò,et al. Graph Attention Networks , 2017, ICLR.

[16] Hermann Ney,et al. Deep Sign: Enabling Robust Statistical Continuous Sign Language Recognition via Hybrid CNN-HMMs , 2018, International Journal of Computer Vision.

[17] Hermann Ney,et al. Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Jie Huang,et al. Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[19] Meng Wang,et al. Hierarchical LSTM for Sign Language Translation , 2018, AAAI.

[20] Houqiang Li,et al. Iterative Alignment Network for Continuous Sign Language Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Sang-Ki Ko,et al. Neural Sign Language Translation based on Human Keypoint Estimation , 2018, Applied Sciences.

[22] Philip S. Yu,et al. A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[23] H. Kashima,et al. Approximation Ratios of Graph Neural Networks for Combinatorial Problems , 2019, NeurIPS.

[24] Shuo Wang,et al. Dense Temporal Convolution Network for Sign Language Translation , 2019, IJCAI.

[25] Houqiang Li,et al. Dynamic Pseudo Label Decoding for Continuous Sign Language Recognition , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[26] Juan Song,et al. Continuous Gesture Segmentation and Recognition Using 3DCNN and Convolutional LSTM , 2019, IEEE Transactions on Multimedia.

[27] Changshui Zhang,et al. A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training , 2019, IEEE Transactions on Multimedia.

[28] Oscar Koller,et al. Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Jesse Read,et al. Better Sign Language Translation with STMC-Transformer , 2020, COLING.

[30] Stephan Günnemann,et al. Directional Message Passing for Molecular Graphs , 2020, ICLR.

[31] Hongdong Li,et al. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[32] Hermann Ney,et al. Weakly Supervised Learning with Multi-Stream CNN-LSTM-HMMs to Discover Sequential Parallelism in Sign Language Videos , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Houqiang Li,et al. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition , 2020, AAAI.

[34] Michèle Gouiffès,et al. Automatic Segmentation of Sign Language into Subtitle-Units , 2020, ECCV Workshops.

[35] Kayo Yin,et al. Sign Language Translation with Transformers , 2020, ArXiv.

[36] Liang Wang,et al. Graph Sequence Recurrent Neural Network for Vision-Based Freezing of Gait Detection , 2020, IEEE Transactions on Image Processing.

[37] Houqiang Li,et al. Boosting Continuous Sign Language Recognition via Cross Modality Augmentation , 2020, ACM Multimedia.

[38] Mitesh M. Khapra,et al. INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition , 2020, ACM Multimedia.

[39] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40] Parth H. Pathak,et al. Hand Pose Guided 3D Pooling for Word-level Sign Language Recognition , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41] Yang Zhao,et al. Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42] A. Tsoi,et al. Graph Fusion Network-Based Multimodal Learning for Freezing of Gait Detection , 2021, IEEE Transactions on Neural Networks and Learning Systems.