Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse

Sign Language Translation (SLT) is a challenging task that aims to generate spoken language sentences from sign language videos, both of which have different grammar and word/gloss order. From a Neural Machine Translation (NMT) perspective, the straightforward way of training translation models is to use sign language phrase-spoken language sentence pairs. However, human interpreters heavily rely on the context to understand the conveyed information, especially for sign language interpretation, where the vocabulary size may be significantly smaller than their spoken language equivalent. Taking direct inspiration from how humans translate, we propose a novel multi-modal transformer architecture that tackles the translation task in a context-aware manner, as a human would. We use the context from previous sequences and confident predictions to disambiguate weaker visual cues. To achieve this we use complementary transformer encoders, namely: (1) A Video Encoder, that captures the low-level video features at the frame-level, (2) A Spotting Encoder, that models the recognized sign glosses in the video, and (3) A Context Encoder, which captures the context of the preceding sign sequences. We combine the information coming from these encoders in a final transformer decoder to generate spoken language translations. We evaluate our approach on the recently published large-scale BOBSL dataset, which contains ~1.2M sequences, and on the SRF dataset, which was part of the WMT-SLT 2022 challenge. We report significant improvements on state-of-the-art translation performance using contextual information, nearly doubling the reported BLEU-4 scores of baseline approaches.

[1]  Xavier Giró-i-Nieto,et al.  Tackling Low-Resourced Sign Language Translation: UPC at WMT-SLT 22 , 2022, WMT.

[2]  S. Dey,et al.  Clean Text and Full-Body Transformer: Microsoft’s Submission to the WMT22 Shared Task on Sign Language Translation , 2022, WMT.

[3]  Andrew Zisserman,et al.  Automatic dense annotation of large-vocabulary sign language videos , 2022, ECCV.

[4]  B. Mak,et al.  C2SLR: Consistency-enhanced Continuous Sign Language Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiaofei He,et al.  MLSLT: Towards Multilingual Sign Language Translation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xiao Sun,et al.  A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Bencie Woll,et al.  BBC-Oxford British Sign Language Dataset , 2021, ArXiv.

[8]  H. Keles,et al.  Using Motion History Images With 3D Convolutional Networks in Isolated Sign Language Recognition , 2021, IEEE Access.

[9]  Xilin Chen,et al.  Self-Mutual Distillation Learning for Continuous Sign Language Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Yuechen Wang,et al.  SignBERT: Pre-Training of Hand-Model-Aware Representation for Sign Language Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Mieke Van Herreweghe,et al.  Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Jakub Kanis,et al.  Mutual Support of Data Modalities in the Task of Sign Language Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[13]  Wengang Zhou,et al.  Improving Sign Language Translation with Monolingual Data by Sign Back-Translation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  N. C. Camgoz,et al.  Content4All Open Research Sign Language Translation Datasets , 2021, 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2021).

[15]  Petros Daras,et al.  Continuous Sign Language Recognition through a Context-Aware Generative Adversarial Network , 2021, Sensors.

[16]  Yun Fu,et al.  Skeleton Aware Multi-modal Sign Language Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  Wengang Zhou,et al.  Spatial-Temporal Multi-Cue Network for Sign Language Recognition and Translation , 2021, IEEE Transactions on Multimedia.

[18]  Sergio Escalera,et al.  Sign Language Recognition: A Deep Survey , 2021, Expert Syst. Appl..

[19]  Oscar Koller,et al.  Multi-channel Transformers for Multi-articulatory Sign Language Translation , 2020, ECCV Workshops.

[20]  Hacer Yalim Keles,et al.  AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods , 2020, IEEE Access.

[21]  Joon Son Chung,et al.  BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues , 2020, ECCV.

[22]  Hacer Yalim Keles,et al.  Evaluation of hidden Markov models using deep CNN features in isolated sign recognition , 2020, Multimedia Tools and Applications.

[23]  Jesse Read,et al.  Better Sign Language Translation with STMC-Transformer , 2020, COLING.

[24]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Lars Petersson,et al.  Transferring Cross-Domain Knowledge for Video Sign Language Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[27]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[28]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[29]  Houqiang Li,et al.  Attention-Based 3D-CNNs for Large-Vocabulary Sign Language Recognition , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Houqiang Li,et al.  Iterative Alignment Network for Continuous Sign Language Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Oscar Koller,et al.  MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language , 2018, BMVC.

[32]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[34]  Wen-gang Zhou,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[35]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[39]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[40]  Sander Dieleman,et al.  Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video , 2015, International Journal of Computer Vision.

[41]  Graham W. Taylor,et al.  ModDrop: Adaptive Multi-Modal Gesture Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Benjamin Schrauwen,et al.  Sign Language Recognition Using Convolutional Neural Networks , 2014, ECCV Workshops.

[44]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[45]  Thad Starner,et al.  American sign language recognition with the kinect , 2011, ICMI '11.

[46]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[47]  Christine Monikowski,et al.  LANGUAGE, COGNITION, AND THE BRAIN: INSIGHTS FROM SIGN LANGUAGE RESEARCH , 2004, Studies in Second Language Acquisition.

[48]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[49]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[50]  K. Emmorey Language, Cognition, and the Brain: Insights From Sign Language Research , 2001 .

[51]  Bencie Woll,et al.  The Linguistics of British Sign Language: An Introduction , 1999 .

[52]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  Josef van Genabith,et al.  DFKI-MLT at WMT-SLT22 Spatio-temporal Sign Language Representation and Translation , 2022 .

[54]  Eleftherios Avramidis,et al.  Experimental Machine Translation of the Swiss German Sign Language via 3D Augmentation of Body Keypoints , 2022, WMT.

[55]  R. Bowden,et al.  Findings of the First WMT Shared Task on Sign Language Translation (WMT-SLT22) , 2022, WMT.

[56]  Karen Livescu,et al.  TTIC’s WMT-SLT 22 Sign Language Translation System , 2022, WMT.

[57]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[58]  Richard Bowden,et al.  Sign Language Recognition , 2011, Visual Analysis of Humans.