Automatic Viseme Vocabulary Construction to Enhance Continuous Lip-reading

Speech is the most common communication method between humans and involves the perception of both auditory and visual channels. Automatic speech recognition focuses on interpreting the audio signals, but it has been demonstrated that video can provide information that is complementary to the audio. Thus, the study of automatic lip-reading is important and is still an open problem. One of the key challenges is the definition of the visual elementary units (the visemes) and their vocabulary. Many researchers have analyzed the importance of the phoneme to viseme mapping and have proposed viseme vocabularies with lengths between 11 and 15 visemes. These viseme vocabularies have usually been manually defined by their linguistic properties and in some cases using decision trees or clustering techniques. In this work, we focus on the automatic construction of an optimal viseme vocabulary based on the association of phonemes with similar appearance. To this end, we construct an automatic system that uses local appearance descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. To compare the performance of the system different descriptors (PCA, DCT and SIFT) are analyzed. We test our system in a Spanish corpus of continuous speech. Our results indicate that we are able to recognize approximately 58% of the visemes, 47% of the phonemes and 23% of the words in a continuous speech scenario and that the optimal viseme vocabulary for Spanish is composed by 20 visemes.

[1]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[2]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[3]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[4]  Valery A. Petrushin Hidden Markov Models: Fundamentals and Applications Part 1: Markov Chains and Mixture Models , 2000 .

[5]  Oscar N. Garcia,et al.  Continuous optical automatic speech recognition by lipreading , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Darryl Stewart,et al.  Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos , 2008, EURASIP J. Image Video Process..

[8]  David B. Pisoni,et al.  Language identification from visual-only speech signals , 2010, Attention, perception & psychophysics.

[9]  Johan A. du Preez,et al.  Audio-Visual Speech Recognition using SciPy , 2010 .

[10]  Stephen J. Cox,et al.  Improving lip-reading performance for robust audiovisual speech recognition using DNNs , 2015, AVSP.

[11]  Alejandro F. Frangi,et al.  Active Shape Models with Invariant Optimal Features: Application to Facial Analysis , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Barry-John Theobald,et al.  Comparing visual features for lipreading , 2009, AVSP.

[13]  Naomi Harte,et al.  Viseme definitions comparison for visual-only speech recognition , 2011, 2011 19th European Signal Processing Conference.

[14]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[15]  Vijeta Sahu,et al.  Result based analysis of various lip tracking systems , 2013, 2013 International Conference on Green High Performance Computing (ICGHPC).

[16]  J. Pohlmann,et al.  Parallel Analysis: a method for determining significant principal components , 1995 .

[17]  Juergen Luettin,et al.  Visual speech recognition using active shape models and hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[18]  Alejandro F. Frangi,et al.  AV@CAR: A Spanish Multichannel Multimodal Corpus for In-Vehicle Automatic Audio-Visual Speech Recognition , 2004, LREC.

[19]  I. R. Rodríguez Ortiz Lipreading in the prelingually deaf: what makes a skilled speechreader? , 2008, The Spanish journal of psychology.

[20]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[21]  Anneleen Van Assche,et al.  Ensemble Methods for Noise Elimination in Classification Problems , 2003, Multiple Classifier Systems.

[22]  Barry-John Theobald,et al.  Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading? , 2014, ISVC.

[23]  A Markides,et al.  Speechreading (lipreading). , 1979, Child: care, health and development.

[24]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[25]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[26]  M. Verleysen,et al.  Classification in the Presence of Label Noise: A Survey , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[27]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[28]  Taghi M. Khoshgoftaar,et al.  Comparing Boosting and Bagging Techniques With Noisy and Imbalanced Data , 2011, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[29]  Hongbin Zha,et al.  Unsupervised Random Forest Manifold Alignment for Lipreading , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  R. Daniloff,et al.  Investigation of the timing of velar movements during speech. , 1971, The Journal of the Acoustical Society of America.

[31]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[32]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[33]  Léon J. M. Rothkrantz,et al.  Automatic Visual Speech Recognition , 2012 .

[34]  Stefanos Zafeiriou,et al.  A survey on mouth modeling and analysis for Sign Language recognition , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[35]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[36]  Engin Erzin,et al.  Comparison of Phoneme and Viseme Based Acoustic Units for Speech Driven Realistic Lip Animation , 2007 .

[37]  K. Munhall,et al.  Spatial statistics of gaze fixations during dynamic face processing , 2007, Social neuroscience.

[38]  Dinesh Kant Kumar,et al.  Visual Speech Recognition Using Motion Features and Hidden Markov Models , 2007, CAIP.

[39]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  Mohammed Bennamoun,et al.  Listening with Your Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[42]  Matti Pietikäinen,et al.  A Compact Representation of Visual Speech Data Using Latent Variables , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Barry-John Theobald,et al.  Comparison of human and machine-based lip-reading , 2009, AVSP.

[44]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[45]  Matti Pietikäinen,et al.  A review of recent advances in visual speech decoding , 2014, Image Vis. Comput..