Weakly Supervised Automatic Transcription of Mouthings for Gloss-Based Sign Language Corpora

In this work we propose a method to automatically annotate mouthings in sign language corpora, requiring no more than a simple gloss annotation and a source of weak supervision, such as automatic speech transcripts. For a long time, research on automatic recognition of sign language has focused on the manual components. However, a full understanding of sign language is not possible without exploring its remaining parameters. Mouthings provide important information to disambiguate homophones with respect to the manuals. Nevertheless most corpora for pattern recognition purposes are lacking any mouthing annotations. To our knowledge no previous work exists that automatically annotates mouthings in the context of sign language. Our method produces a frame error rate of 39% for a single signer on the alignment task.

[1]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[2]  Hermann Ney,et al.  Enhancing gloss-based corpora with facial features using active appearance models , 2013 .

[3]  Dimitris N. Metaxas,et al.  Handshapes and movements: Multiple-channel ASL recognition , 2004 .

[4]  Alexis Héloir,et al.  Assessing the deaf user perspective on sign language avatars , 2011, ASSETS.

[5]  Andrew Zisserman,et al.  Large-scale Learning of Sign Language by Watching TV (Using Co-occurrences) , 2013, BMVC.

[6]  Samir I. Shaheen,et al.  Sign language recognition using a combination of new vision based features , 2011, Pattern Recognit. Lett..

[7]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[8]  Richard Bowden,et al.  Learning Sequential Patterns for Lipreading , 2011, BMVC.

[9]  Julia Weisenberg Audience effects in American Sign Language interpretation. , 2011 .

[10]  E. A. Elliott,et al.  Phonological Functions of Facial Movements: Evidence from deaf users of German Sign Language , 2013 .

[11]  S. Kutscher Ikonizität und Indexikalität im gebärdensprachlichen Lexikon – Zur Typologie sprachlicher Zeichen , 2010 .

[12]  Naomi Harte,et al.  Phoneme-to-viseme Mapping for Visual Speech Recognition , 2012, ICPRAM.

[13]  B. Woll,et al.  Frequency distribution and spreading behavior of different types of mouth actions in three sign languages , 2008 .

[14]  Bianca Aschenberner,et al.  A German viseme-set for automatic transcription of input text used for audio-visual speech synthesis , 2005, INTERSPEECH.

[15]  Wendy Sandler,et al.  Sign Language and Linguistic Universals: Entering the lexicon: lexicalization, backformation, and cross-modal borrowing , 2006 .

[16]  Barry-John Theobald,et al.  Insights into machine lip reading , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  R. Sutton-Spence Mouthings and Simultaneity in British Sign Language , 2007 .

[18]  Zafer Yavuz,et al.  Automatic Lipreading , 2007, 2007 IEEE 15th Signal Processing and Communications Applications.

[19]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Robert C. Johnson,et al.  The Deaf Way: Perspectives from the International Conference on Deaf Culture , 1994 .

[21]  Klaus Beulen Phonetische Entscheidungsbäume für die automatische Spracherkennung mit großem Vokabular , 1999 .

[22]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.