BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. We make use of weakly-aligned subtitles for broadcast footage together with a keyword spotting method to automatically localise sign-instances for a vocabulary of 1,000 signs in 1,000 hours of video. We make the following contributions: (1) We show how to use mouthing cues from signers to obtain high-quality annotations from video data - the result is the BSL-1K dataset, a collection of British Sign Language (BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks - we exceed the state of the art on both the MSASL and WLASL benchmarks. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area.

[1]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[2]  Shinichi Tamura,et al.  Recognition of sign language motion images , 1988, Pattern Recognit..

[3]  Thad Starner,et al.  Visual Recognition of American Sign Language Using Hidden Markov Models. , 1995 .

[4]  B. Woll,et al.  The Linguistics of British Sign Language: Index of signs in the text , 1999 .

[5]  Bencie Woll,et al.  The Linguistics of British Sign Language: An Introduction , 1999 .

[6]  Bencie Woll,et al.  The sign that dares to speak its name: echo phonology in British Sign Language , 2001 .

[7]  Karl-Friedrich Kraiss,et al.  Extraction of 3D hand shape and posture from image sequences for sign language recognition , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[8]  Clayton Valli,et al.  The Gallaudet Dictionary of American Sign Language , 2021 .

[9]  Ali Farhadi,et al.  Aligning ASL for Statistical Translation Using a Discriminative Word Model , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10]  Avinash C. Kak,et al.  Purdue RVL-SLLL American Sign Language Database , 2006 .

[11]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12]  R. Sutton-Spence Mouthings and Simultaneity in British Sign Language , 2007 .

[13]  Karl-Friedrich Kraiss,et al.  Recent developments in visual sign language recognition , 2008, Universal Access in the Information Society.

[14]  Ali Farhadi,et al.  Transfer Learning in Sign language , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Moritz Knorr,et al.  The significance of facial features for automatic sign language recognition , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[16]  B. Woll,et al.  Frequency distribution and spreading behavior of different types of mouth actions in three sign languages , 2008 .

[17]  Stan Sclaroff,et al.  The American Sign Language Lexicon Video Dataset , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Surendra Ranganath,et al.  Tracking facial features under occlusions and recognizing facial expressions in sign language , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[19]  Helen Cooper,et al.  Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, CVPR.

[20]  Andrew Zisserman,et al.  Learning sign language by watching TV (using weakly aligned subtitles) , 2009, CVPR.

[21]  Richard Bowden,et al.  Sign Language Recognition , 2011, Visual Analysis of Humans.

[22]  Richard Bank,et al.  Variation in mouth actions with manual signs in Sign Language of the Netherlands (NGT) , 2011 .

[23]  Alexei A. Efros,et al.  Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[24]  Nicolas Pugeault,et al.  Reading the signs: A video based sign dictionary , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[25]  Nicolas Pugeault,et al.  Sign Language Recognition using Sequential Pattern Trees , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Andrew Zisserman,et al.  Large-scale Learning of Sign Language by Watching TV (Using Co-occurrences) , 2013, BMVC.

[27]  Jordan Fenlon,et al.  Building the British Sign Language Corpus , 2013 .

[28]  Hermann Ney,et al.  Modality Combination Techniques for Continuous Sign Language Recognition , 2013, IbPRIA.

[29]  Hermann Ney,et al.  Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition , 2014, ECCV.

[30]  Jordan Fenlon,et al.  British Sign Language Corpus Project: A corpus of digital video data and annotations of British Sign Language 2008-2014 , 2014 .

[31]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32]  Nicolas Pugeault,et al.  Sign Spotting Using Hierarchical Sequential Patterns with Temporal Intervals , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Andrew Zisserman,et al.  Domain-Adaptive Discriminative One-Shot Learning of Gestures , 2014, ECCV.

[34]  Jorma Laaksonen,et al.  S-pot - a benchmark in spotting signs within continuous signing , 2014, LREC.

[35]  Hermann Ney,et al.  Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora , 2014 .

[36]  Hermann Ney,et al.  Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[37]  Houqiang Li,et al.  Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[38]  Hermann Ney,et al.  Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[39]  Stefanos Zafeiriou,et al.  A survey on mouth modeling and analysis for Sign Language recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[42]  Shuo Yang,et al.  WIDER FACE: A Face Detection Benchmark , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[44]  Oscar Koller,et al.  Using Convolutional 3D Neural Networks for User-independent continuous gesture recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[45]  Joon Son Chung,et al.  Signs in time: Encoding human motion as a temporal image , 2016, ArXiv.

[46]  Hermann Ney,et al.  Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Oscar Koller,et al.  SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Hermann Ney,et al.  Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Yingli Tian,et al.  Recognizing American Sign Language Gestures from Within Continuous Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[54]  Jie Huang,et al.  Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[55]  Themos Stafylakis,et al.  Zero-shot keyword spotting for visual speech recognition in-the-wild , 2018, ECCV.

[56]  C. V. Jawahar,et al.  Word Spotting in Silent Lip Videos , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[57]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Nazli Ikizler-Cinbis,et al.  Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages? , 2019, BMVC.

[59]  Oscar Koller,et al.  MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language , 2018, BMVC.

[60]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[61]  Joanna Materzynska,et al.  The Jester Dataset: A Large-Scale Video Dataset of Human Gestures , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[62]  Liang Zheng,et al.  Spotting Visual Keywords from Temporal Sliding Windows , 2019, ICMI.

[63]  Andrew Zisserman,et al.  Seeing wake words: Audio-visual Keyword Spotting , 2020, BMVC.

[64]  Hongdong Li,et al.  Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[65]  Houqiang Li,et al.  Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition , 2020, AAAI.

[66]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Lars Petersson,et al.  Transferring Cross-Domain Knowledge for Video Sign Language Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.