Gloss Alignment using Word Embeddings

Capturing and annotating Sign language datasets is a time consuming and costly process. Current datasets are orders of magnitude too small to successfully train unconstrained Sign Language Translation (SLT) models. As a result, research has turned to TV broadcast content as a source of large-scale training data, consisting of both the sign language interpreter and the associated audio subtitle. However, lack of sign language annotation limits the usability of this data and has led to the development of automatic annotation techniques such as sign spotting. These spottings are aligned to the video rather than the subtitle, which often results in a misalignment between the subtitle and spotted signs. In this paper we propose a method for aligning spottings with their corresponding subtitles using large spoken language models. Using a single modality means our method is computationally inexpensive and can be utilized in conjunction with existing alignment techniques. We quantitatively demonstrate the effectiveness of our method on the Meine DGS-Annotated (MeineDGS) and BBC-Oxford British Sign Language (BOBSL) datasets, recovering up to a 33.22 BLEU-1 score in word alignment.

[1]  Andrew Zisserman,et al.  Automatic dense annotation of large-vocabulary sign language videos , 2022, ECCV.

[2]  N. C. Camgoz,et al.  Signing at Scale: Learning to Co-Articulate Signs for Large-Scale Photo-Realistic Sign Language Production , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Bencie Woll,et al.  BBC-Oxford British Sign Language Dataset , 2021, ArXiv.

[4]  Andrew Zisserman,et al.  Aligning Subtitles in Sign Language Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Andrew Zisserman,et al.  Read and Attend: Temporal Localisation in Sign Language Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Stefan Schweter,et al.  German’s Next Language Model , 2020, COLING.

[7]  Andrew Zisserman,et al.  Watch, read and lookup: learning to spot signs from multiple supervisors , 2020, ACCV.

[8]  Joon Son Chung,et al.  BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues , 2020, ECCV.

[9]  Oscar Koller,et al.  Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Matt Post,et al.  A Discriminative Neural Model for Cross-Lingual Word Alignment , 2019, EMNLP.

[11]  Meredith Ringel Morris,et al.  Sign Language Recognition, Generation, and Translation: An Interdisciplinary Perspective , 2019, ASSETS.

[12]  Heike Brock,et al.  Learning Motion Disfluencies for Automatic Sign Language Segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  John DeNero,et al.  Adding Interpretable Attention to Neural Translation Models Improves Word Alignment , 2019, ArXiv.

[14]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Tomas Mikolov,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[16]  Taro Watanabe,et al.  Recurrent Neural Networks for Word Alignment Model , 2014, ACL.

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Lale Akarun,et al.  Automatic sign segmentation from continuous signing via multiple sequence alignment , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[19]  Stan Sclaroff,et al.  Sign Language Spotting with a Threshold Model Based on Conditional Random Fields , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Richard Bowden,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[23]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[24]  Bencie Woll,et al.  The Linguistics of British Sign Language: An Introduction , 1999 .

[25]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[26]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[27]  Michèle Gouiffès,et al.  Automatic Segmentation of Sign Language into Subtitle-Units , 2020, ECCV Workshops.

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Don Tuggener,et al.  Incremental Coreference Resolution for German , 2016 .