Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition

This work presents a framework to recognise signer independent mouthings in continuous sign language, with no manual annotations needed. Mouthings represent lip-movements that correspond to pronunciations of words or parts of them during signing. Research on sign language recognition has focused extensively on the hands as features. But sign language is multi-modal and a full understanding particularly with respect to its lexical variety, language idioms and grammatical structures is not possible without further exploring the remaining information channels. To our knowledge no previous work has explored dedicated viseme recognition in the context of sign language recognition. The approach is trained on over 180.000 unlabelled frames and reaches 47.1% precision on the frame level. Generalisation across individuals and the influence of context-dependent visemes are analysed.

[1]  Alex Pentland,et al.  Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Abeer Alwan,et al.  Similarity structure in perceptual and physical measures for visual Consonants across talkers , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Aseel Turkmani,et al.  Visual analysis of viseme dynamics , 2008 .

[4]  Surendra Ranganath,et al.  Automatic Sign Language Analysis: A Survey and the Future beyond Lexical Meaning , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Matti Pietikäinen,et al.  Towards a practical lipreading system , 2011, CVPR 2011.

[6]  K. Emmorey Language, Cognition, and the Brain: Insights From Sign Language Research , 2001 .

[7]  Mohammad Mahdi Dehshibi,et al.  Clustering Persian viseme using phoneme subspace for developing visual speech application , 2013, Multimedia Tools and Applications.

[8]  Léon J. M. Rothkrantz,et al.  Automatic Visual Speech Recognition , 2012 .

[9]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[10]  Charles Markham,et al.  Weakly Supervised Training of a Sign Language Recognition System Using Multiple Instance Learning Density Matrices , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[11]  Barry-John Theobald,et al.  Insights into machine lip reading , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[13]  Antonio Camurri,et al.  Gesture-Based Communication in Human-Computer Interaction , 2003, Lecture Notes in Computer Science.

[14]  Andrew Zisserman,et al.  Employing signed TV broadcasts for automated learning of British Sign Language , 2010 .

[15]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[16]  Hermann Ney,et al.  Enhancing gloss-based corpora with facial features using active appearance models , 2013 .

[17]  Barry-John Theobald,et al.  In pursuit of visemes , 2010, AVSP.

[18]  Dimitris N. Metaxas,et al.  Handshapes and movements: Multiple-channel ASL recognition , 2004 .

[19]  Adrian Hilton,et al.  Visual Analysis of Humans - Looking at People , 2013 .

[20]  Johan A. du Preez,et al.  Audio-Visual Speech Recognition using SciPy , 2010 .

[21]  Takeo Kanade,et al.  Real-time combined 2D+3D active appearance models , 2004, CVPR 2004.

[22]  Nicolas Pugeault,et al.  Sign language recognition using sub-units , 2012, J. Mach. Learn. Res..

[23]  Takeo Kanade,et al.  Recognizing Action Units for Facial Expression Analysis , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Dimitris N. Metaxas,et al.  Computer-based recognition of facial expressions in ASL : From face tracking to linguistic interpretation , 2010 .

[25]  Constantine Stephanidis,et al.  Universal access in the information society , 1999, HCI.

[26]  S. Ramakrishnan,et al.  Speech Enhancement, Modeling And Recognition: Algorithms And Applications , 2014 .

[27]  Samir I. Shaheen,et al.  Sign language recognition using a combination of new vision based features , 2011, Pattern Recognit. Lett..

[28]  Ralph Gross,et al.  Generic vs. person specific active appearance models , 2005, Image Vis. Comput..

[29]  E. A. Elliott,et al.  Phonological Functions of Facial Movements: Evidence from deaf users of German Sign Language , 2013 .

[30]  Klaus Beulen Phonetische Entscheidungsbäume für die automatische Spracherkennung mit großem Vokabular , 1999 .

[31]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[32]  Hermann Ney,et al.  May the force be with you: Force-aligned signwriting for automatic subunit annotation of corpora , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[33]  Robert Bayley,et al.  What's Your Sign for Pizza?: An Introduction to Variation in American Sign Language , 2003 .

[34]  Wendy Sandler,et al.  Sign Language and Linguistic Universals: Entering the lexicon: lexicalization, backformation, and cross-modal borrowing , 2006 .

[35]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[36]  Siome Goldenstein,et al.  Facial movement analysis in ASL , 2007, Universal Access in the Information Society.

[37]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Richard Bowden,et al.  Sign Language Recognition , 2011, Visual Analysis of Humans.

[39]  Dimitris N. Metaxas,et al.  Handshapes and Movements: Multiple-Channel American Sign Language Recognition , 2003, Gesture Workshop.

[40]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[41]  Andrew Zisserman,et al.  Large-scale Learning of Sign Language by Watching TV (Using Co-occurrences) , 2013, BMVC.

[42]  Shaogang Gong,et al.  Facial expression recognition based on Local Binary Patterns: A comprehensive study , 2009, Image Vis. Comput..