论文信息 - Skeletal Graph Self-Attention: Embedding a Skeleton Inductive Bias into Sign Language Production

Skeletal Graph Self-Attention: Embedding a Skeleton Inductive Bias into Sign Language Production

Recent approaches to Sign Language Production (SLP) have adopted spoken language Neural Machine Translation (NMT) architectures, applied without sign-specific modifications. In addition, these works represent sign language as a sequence of skeleton pose vectors, projected to an abstract representation with no inherent skeletal structure. In this paper, we represent sign language sequences as a skeletal graph structure, with joints as nodes and both spatial and temporal connections as edges. To operate on this graphical structure, we propose Skeletal Graph SelfAttention (SGSA), a novel graphical attention layer that embeds a skeleton inductive bias into the SLP model. Retaining the skeletal feature representation throughout, we directly apply a spatio-temporal adjacency matrix into the self-attention formulation. This provides structure and context to each skeletal joint that is not possible when using a non-graphical abstract representation, enabling fluid and expressive sign language production. We evaluate our Skeletal Graph Self-Attention architecture on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset, achieving state-of-the-art back translation performance with an 8% and 7% improvement over competing methods for the dev and test sets.

[1] Xuancheng Ren,et al. Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[2] Guy Lapalme,et al. Text generation , 1990 .

[3] Lukasz Kaiser,et al. Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[4] Antonio Ortega,et al. Graph Based Skeleton Modeling for Human Activity Analysis , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[5] Zhou Zhao,et al. Towards Fast and High-Quality Sign Language Production , 2021, ACM Multimedia.

[6] Hermann Ney,et al. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[7] Stefan Riezler,et al. Joey NMT: A Minimalist NMT Toolkit for Novices , 2019, EMNLP.

[8] Ian Marshall,et al. Linguistic modelling and language-processing technologies for Avatar-based sign language presentation , 2008, Universal Access in the Information Society.

[9] M. F. Tolba,et al. A proposed graph matching technique for Arabic sign language continuous sentences recognition , 2012, 2012 8th International Conference on Informatics and Systems (INFOS).

[10] Yuan Luo,et al. Graph Convolutional Networks for Text Classification , 2018, AAAI.

[11] Hongdong Li,et al. Learning Trajectory Dependencies for Human Motion Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12] Fabio Martínez,et al. How important is motion in sign language translation? , 2021, IET Comput. Vis..

[13] Pietro Liò,et al. Graph Attention Networks , 2017, ICLR.

[14] Roland Pfau,et al. Nonmanuals: their grammatical and prosodic roles , 2010 .

[15] Tong Zhang,et al. Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[16] Oscar Koller,et al. SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Dominique Beaini,et al. Rethinking Graph Transformers with Spectral Attention , 2021, NeurIPS.

[19] Jian Tang,et al. Session-Based Social Recommendation via Dynamic Graph Attention Networks , 2019, WSDM.

[20] Oscar Koller,et al. Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.

[21] Jan Zelinka,et al. Neural Sign Language Synthesis: Words Are Our Glosses , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22] Tunga Güngör,et al. A Hybrid Translation System from Turkish Spoken Language to Turkish Sign Language , 2019, 2019 IEEE International Symposium on INnovations in Intelligent SysTems and Applications (INISTA).

[23] Ben Saunders,et al. Everybody Sign Now: Translating Spoken Language to Photo Realistic Sign Language Video , 2020, ArXiv.

[24] Song-Chun Zhu,et al. Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[25] Bencie Woll,et al. The Linguistics of British Sign Language: An Introduction , 1999 .

[26] Dahua Lin,et al. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[27] Taro Miyazaki,et al. Machine Translation from Spoken Language to Sign Language using Pre-trained Language Model as Encoder , 2020, LREC 2020.

[28] Shuai Yi,et al. Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction , 2020, ECCV.

[29] Razvan Pascanu,et al. Learning Deep Generative Models of Graphs , 2018, ICLR 2018.

[30] Matt Huenerfauth,et al. Collecting a Motion-Capture Corpus of American Sign Language for Data-Driven Generation Research , 2010, SLPAT@NAACL.

[31] Thomas Hanke,et al. DGS Corpus & Dicta-Sign: The Hamburg Studio Setup , 2010 .

[32] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33] Necati Cihan Camgöz,et al. Progressive Transformers for End-to-End Sign Language Production , 2020, ECCV.

[34] Ben Saunders,et al. Anonysign: Novel Human Appearance Synthesis for Sign Language Video Anonymisation , 2021, 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021).

[35] Ming Ouhyoung,et al. A sign language recognition system using hidden markov model and context sensitive search , 1996, VRST.

[36] Ben Saunders,et al. Adversarial Training for Multi-Channel Sign Language Production , 2020, BMVC.

[37] Lu Meng,et al. An Attention-Enhanced Multi-Scale and Dual Sign Language Recognition Network Based on a Graph Convolution Network , 2021, Sensors.

[38] Stavroula-Evita Fotinea,et al. GSLC: Creation and Annotation of a Greek Sign Language Corpus for HCI , 2007, HCI.

[39] Beth Wilson,et al. Neural networks for sign language translation , 1993, Defense, Security, and Sensing.

[40] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Horst Bischof,et al. Skeletal Graph Based Human Pose Estimation in Real-Time , 2011, BMVC.

[42] Ben Saunders,et al. Mixed SIGNals: Sign Language Production via a Mixture of Motion Primitives , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] Wenhan Shi,et al. Conditional Structure Generation through Graph Variational Generative Adversarial Nets , 2019, NeurIPS.

[44] Xavier Bresson,et al. A Generalization of Transformer Networks to Graphs , 2020, ArXiv.

[45] Lei Shi,et al. Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Giacomo Inches,et al. Content4All Open Research Sign Language Translation Datasets , 2021, 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021).

[47] Silvio Savarese,et al. Social-BiGAT: Multimodal Trajectory Forecasting using Bicycle-GAN and Graph Attention Networks , 2019, NeurIPS.

[48] Jaewoo Kang,et al. Graph Transformer Networks , 2019, NeurIPS.

[49] Kirsti Grobel,et al. Isolated sign language recognition using hidden Markov models , 1996, 1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.

[50] Pietro Cavallo,et al. Relational Graph Attention Networks , 2018, ArXiv.

[51] Hermann Ney,et al. Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52] Qinkun Xiao,et al. Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people , 2020, Neural Networks.

[53] Andreas Stafylopatis,et al. Statistical Machine Translation for Greek to Greek Sign Language Using Parallel Corpora Produced via Rule-Based Machine Translation , 2018, CIMA@ICTAI.

[54] Shing Chiang Tan,et al. Isolated sign language recognition using Convolutional Neural Network hand modelling and Hand Energy Image , 2019, Multimedia Tools and Applications.

[55] W. Stokoe. Sign Language Structure , 1980 .

[56] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[57] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[58] Robin Wilson,et al. Modern Graph Theory , 2013 .

[59] Ben Saunders,et al. Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks , 2021, International Journal of Computer Vision.

[60] Chao Xie,et al. Chinese sign language recognition with adaptive HMM , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[61] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[62] Yun Fu,et al. Skeleton Aware Multi-modal Sign Language Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[63] Mirella Lapata,et al. Text Generation from Knowledge Graphs with Graph Transformers , 2019, NAACL.

[64] Cleber Zanchettin,et al. Spatial-Temporal Graph Convolutional Networks for Sign Language Recognition , 2019, ICANN.

[65] Joan Bruna,et al. Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[66] Richard Bowden,et al. Sign Language Production using Neural Machine Translation and Generative Adversarial Networks , 2018, BMVC.

[67] Christopher D. Manning,et al. Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[68] Mariusz Flasinski,et al. On the use of graph parsing for recognition of isolated hand postures of Polish Sign Language , 2010, Pattern Recognit..

[69] Kayo Yin,et al. Sign Language Translation with Transformers , 2020, ArXiv.

[70] Haibin Ling,et al. TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking , 2021, ArXiv.

[71] Oscar Koller,et al. Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[72] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[73] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.