Skeletal Graph Self-Attention: Embedding a Skeleton Inductive Bias into Sign Language Production

Recent approaches to Sign Language Production (SLP) have adopted spoken language Neural Machine Translation (NMT) architectures, applied without sign-specific modifications. In addition, these works represent sign language as a sequence of skeleton pose vectors, projected to an abstract representation with no inherent skeletal structure. In this paper, we represent sign language sequences as a skeletal graph structure, with joints as nodes and both spatial and temporal connections as edges. To operate on this graphical structure, we propose Skeletal Graph SelfAttention (SGSA), a novel graphical attention layer that embeds a skeleton inductive bias into the SLP model. Retaining the skeletal feature representation throughout, we directly apply a spatio-temporal adjacency matrix into the self-attention formulation. This provides structure and context to each skeletal joint that is not possible when using a non-graphical abstract representation, enabling fluid and expressive sign language production. We evaluate our Skeletal Graph Self-Attention architecture on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, achieving state-of-the-art back translation performance with an 8% and 7% improvement over competing methods for the dev and test sets.

[1]  Xuancheng Ren,et al.  Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[2]  Guy Lapalme,et al.  Text generation , 1990 .

[3]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[4]  Antonio Ortega,et al.  Graph Based Skeleton Modeling for Human Activity Analysis , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[5]  Zhou Zhao,et al.  Towards Fast and High-Quality Sign Language Production , 2021, ACM Multimedia.

[6]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[7]  Stefan Riezler,et al.  Joey NMT: A Minimalist NMT Toolkit for Novices , 2019, EMNLP.

[8]  Ian Marshall,et al.  Linguistic modelling and language-processing technologies for Avatar-based sign language presentation , 2008, Universal Access in the Information Society.

[9]  M. F. Tolba,et al.  A proposed graph matching technique for Arabic sign language continuous sentences recognition , 2012, 2012 8th International Conference on Informatics and Systems (INFOS).

[10]  Yuan Luo,et al.  Graph Convolutional Networks for Text Classification , 2018, AAAI.

[11]  Hongdong Li,et al.  Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Fabio Martínez,et al.  How important is motion in sign language translation? , 2021, IET Comput. Vis..

[13]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[14]  Roland Pfau,et al.  Nonmanuals: their grammatical and prosodic roles , 2010 .

[15]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[16]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Dominique Beaini,et al.  Rethinking Graph Transformers with Spectral Attention , 2021, NeurIPS.

[19]  Jian Tang,et al.  Session-Based Social Recommendation via Dynamic Graph Attention Networks , 2019, WSDM.

[20]  Oscar Koller,et al.  Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.

[21]  Jan Zelinka,et al.  Neural Sign Language Synthesis: Words Are Our Glosses , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  Tunga Güngör,et al.  A Hybrid Translation System from Turkish Spoken Language to Turkish Sign Language , 2019, 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA).

[23]  Ben Saunders,et al.  Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video , 2020, ArXiv.

[24]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[25]  Bencie Woll,et al.  The Linguistics of British Sign Language: An Introduction , 1999 .

[26]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[27]  Taro Miyazaki,et al.  Machine Translation from Spoken Language to Sign Language using Pre-trained Language Model as Encoder , 2020, LREC 2020.

[28]  Shuai Yi,et al.  Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction , 2020, ECCV.

[29]  Razvan Pascanu,et al.  Learning Deep Generative Models of Graphs , 2018, ICLR 2018.

[30]  Matt Huenerfauth,et al.  Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research , 2010, SLPAT@NAACL.

[31]  Thomas Hanke,et al.  DGS Corpus & Dicta-Sign: The Hamburg Studio Setup , 2010 .

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Necati Cihan Camgöz,et al.  Progressive Transformers for End-to-End Sign Language Production , 2020, ECCV.

[34]  Ben Saunders,et al.  Anonysign: Novel Human Appearance Synthesis for Sign Language Video Anonymisation , 2021, 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021).

[35]  Ming Ouhyoung,et al.  A sign language recognition system using hidden markov model and context sensitive search , 1996, VRST.

[36]  Ben Saunders,et al.  Adversarial Training for Multi-Channel Sign Language Production , 2020, BMVC.

[37]  Lu Meng,et al.  An Attention-Enhanced Multi-Scale and Dual Sign Language Recognition Network Based on a Graph Convolution Network , 2021, Sensors.

[38]  Stavroula-Evita Fotinea,et al.  GSLC: Creation and Annotation of a Greek Sign Language Corpus for HCI , 2007, HCI.

[39]  Beth Wilson,et al.  Neural networks for sign language translation , 1993, Defense, Security, and Sensing.

[40]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Horst Bischof,et al.  Skeletal Graph Based Human Pose Estimation in Real-Time , 2011, BMVC.

[42]  Ben Saunders,et al.  Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Wenhan Shi,et al.  Conditional Structure Generation through Graph Variational Generative Adversarial Nets , 2019, NeurIPS.

[44]  Xavier Bresson,et al.  A Generalization of Transformer Networks to Graphs , 2020, ArXiv.

[45]  Lei Shi,et al.  Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Giacomo Inches,et al.  Content4All Open Research Sign Language Translation Datasets , 2021, 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021).

[47]  Silvio Savarese,et al.  Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks , 2019, NeurIPS.

[48]  Jaewoo Kang,et al.  Graph Transformer Networks , 2019, NeurIPS.

[49]  Kirsti Grobel,et al.  Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[50]  Pietro Cavallo,et al.  Relational Graph Attention Networks , 2018, ArXiv.

[51]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Qinkun Xiao,et al.  Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people , 2020, Neural Networks.

[53]  Andreas Stafylopatis,et al.  Statistical Machine Translation for Greek to Greek Sign Language Using Parallel Corpora Produced via Rule-Based Machine Translation , 2018, CIMA@ICTAI.

[54]  Shing Chiang Tan,et al.  Isolated sign language recognition using Convolutional Neural Network hand modelling and Hand Energy Image , 2019, Multimedia Tools and Applications.

[55]  W. Stokoe Sign Language Structure , 1980 .

[56]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[57]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[58]  Robin Wilson,et al.  Modern Graph Theory , 2013 .

[59]  Ben Saunders,et al.  Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks , 2021, International Journal of Computer Vision.

[60]  Chao Xie,et al.  Chinese sign language recognition with adaptive HMM , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[61]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[62]  Yun Fu,et al.  Skeleton Aware Multi-modal Sign Language Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[63]  Mirella Lapata,et al.  Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[64]  Cleber Zanchettin,et al.  Spatial-Temporal Graph Convolutional Networks for Sign Language Recognition , 2019, ICANN.

[65]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[66]  Richard Bowden,et al.  Sign Language Production using Neural Machine Translation and Generative Adversarial Networks , 2018, BMVC.

[67]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[68]  Mariusz Flasinski,et al.  On the use of graph parsing for recognition of isolated hand postures of Polish Sign Language , 2010, Pattern Recognit..

[69]  Kayo Yin,et al.  Sign Language Translation with Transformers , 2020, ArXiv.

[70]  Haibin Ling,et al.  TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking , 2021, ArXiv.

[71]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[73]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.