论文信息 - BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. We make use of weakly-aligned subtitles for broadcast footage together with a keyword spotting method to automatically localise sign-instances for a vocabulary of 1,000 signs in 1,000 hours of video. We make the following contributions: (1) We show how to use mouthing cues from signers to obtain high-quality annotations from video data - the result is the BSL-1K dataset, a collection of British Sign Language (BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks - we exceed the state of the art on both the MSASL and WLASL benchmarks. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area.

[1] C. G. Fisher,et al. Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[2] Shinichi Tamura,et al. Recognition of sign language motion images , 1988, Pattern Recognit..

[3] Thad Starner,et al. Visual Recognition of American Sign Language Using Hidden Markov Models. , 1995 .

[4] B. Woll,et al. The Linguistics of British Sign Language: Index of signs in the text , 1999 .

[5] Bencie Woll,et al. The Linguistics of British Sign Language: An Introduction , 1999 .

[6] Bencie Woll,et al. The sign that dares to speak its name: echo phonology in British Sign Language , 2001 .

[7] Karl-Friedrich Kraiss,et al. Extraction of 3D hand shape and posture from image sequences for sign language recognition , 2003, 2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443).

[8] Clayton Valli,et al. The Gallaudet Dictionary of American Sign Language , 2021 .

[9] Ali Farhadi,et al. Aligning ASL for Statistical Translation Using a Discriminative Word Model , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[10] Avinash C. Kak,et al. Purdue RVL-SLLL American Sign Language Database , 2006 .

[11] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[12] R. Sutton-Spence. Mouthings and Simultaneity in British Sign Language , 2007 .

[13] Karl-Friedrich Kraiss,et al. Recent developments in visual sign language recognition , 2008, Universal Access in the Information Society.

[14] Ali Farhadi,et al. Transfer Learning in Sign language , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Moritz Knorr,et al. The significance of facial features for automatic sign language recognition , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[16] B. Woll,et al. Frequency distribution and spreading behavior of different types of mouth actions in three sign languages , 2008 .

[17] Stan Sclaroff,et al. The American Sign Language Lexicon Video Dataset , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[18] Surendra Ranganath,et al. Tracking facial features under occlusions and recognizing facial expressions in sign language , 2008, 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition.

[19] Helen Cooper,et al. Learning signs from subtitles: A weakly supervised approach to sign language recognition , 2009, CVPR.

[20] Andrew Zisserman,et al. Learning sign language by watching TV (using weakly aligned subtitles) , 2009, CVPR.

[21] Richard Bowden,et al. Sign Language Recognition , 2011, Visual Analysis of Humans.

[22] Richard Bank,et al. Variation in mouth actions with manual signs in Sign Language of the Netherlands (NGT) , 2011 .

[23] Alexei A. Efros,et al. Ensemble of exemplar-SVMs for object detection and beyond , 2011, 2011 International Conference on Computer Vision.

[24] Nicolas Pugeault,et al. Reading the signs: A video based sign dictionary , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[25] Nicolas Pugeault,et al. Sign Language Recognition using Sequential Pattern Trees , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Andrew Zisserman,et al. Large-scale Learning of Sign Language by Watching TV (Using Co-occurrences) , 2013, BMVC.

[27] Jordan Fenlon,et al. Building the British Sign Language Corpus , 2013 .

[28] Hermann Ney,et al. Modality Combination Techniques for Continuous Sign Language Recognition , 2013, IbPRIA.

[29] Hermann Ney,et al. Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition , 2014, ECCV.

[30] Jordan Fenlon,et al. British Sign Language Corpus Project: A corpus of digital video data and annotations of British Sign Language 2008-2014 , 2014 .

[31] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[32] Nicolas Pugeault,et al. Sign Spotting Using Hierarchical Sequential Patterns with Temporal Intervals , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Andrew Zisserman,et al. Domain-Adaptive Discriminative One-Shot Learning of Gestures , 2014, ECCV.

[34] Jorma Laaksonen,et al. S-pot - a benchmark in spotting signs within continuous signing , 2014, LREC.

[35] Hermann Ney,et al. Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora , 2014 .

[36] Hermann Ney,et al. Deep Learning of Mouth Shapes for Sign Language , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[37] Houqiang Li,et al. Sign Language Recognition using 3D convolutional neural networks , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[38] Hermann Ney,et al. Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers , 2015, Comput. Vis. Image Underst..

[39] Stefanos Zafeiriou,et al. A survey on mouth modeling and analysis for Sign Language recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[40] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[42] Shuo Yang,et al. WIDER FACE: A Face Detection Benchmark , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.

[44] Oscar Koller,et al. Using Convolutional 3D Neural Networks for User-independent continuous gesture recognition , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[45] Joon Son Chung,et al. Signs in time: Encoding human motion as a temporal image , 2016, ArXiv.

[46] Hermann Ney,et al. Re-Sign: Re-Aligned End-to-End Sequence Modelling with Deep Recurrent CNN-HMMs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48] Oscar Koller,et al. SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Hermann Ney,et al. Neural Sign Language Translation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52] Yingli Tian,et al. Recognizing American Sign Language Gestures from Within Continuous Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53] Omkar M. Parkhi,et al. VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[54] Jie Huang,et al. Video-based Sign Language Recognition without Temporal Segmentation , 2018, AAAI.

[55] Themos Stafylakis,et al. Zero-shot keyword spotting for visual speech recognition in-the-wild , 2018, ECCV.

[56] C. V. Jawahar,et al. Word Spotting in Silent Lip Videos , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[57] Joon Son Chung,et al. Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58] Nazli Ikizler-Cinbis,et al. Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages? , 2019, BMVC.

[59] Oscar Koller,et al. MS-ASL: A Large-Scale Data Set and Benchmark for Understanding American Sign Language , 2018, BMVC.

[60] Geoffrey E. Hinton,et al. Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[61] Joanna Materzynska,et al. The Jester Dataset: A Large-Scale Video Dataset of Human Gestures , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[62] Liang Zheng,et al. Spotting Visual Keywords from Temporal Sliding Windows , 2019, ICMI.

[63] Andrew Zisserman,et al. Seeing wake words: Audio-visual Keyword Spotting , 2020, BMVC.

[64] Hongdong Li,et al. Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[65] Houqiang Li,et al. Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition , 2020, AAAI.

[66] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67] Lars Petersson,et al. Transferring Cross-Domain Knowledge for Video Sign Language Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.