Automatic Visual Speech Recognition

Lip reading was thought for many years to be specific to hearing impaired persons. Therefore, it was considered that lip reading is one possible solution to an abnormal situation. Even the name of the domain suggests that lip reading was considered to be a rather artificial way of communication because it associates lip reading with the written language which is a relatively new cultural phenomenon and is not an evolutionary inherent ability. Extensive lip reading research was primarily done in order to improve the teaching methodology for hearing impaired persons to increase their chances for integration in the society. Later on, the research done in human perception and more exactly in speech perception proved that lip reading is actively employed in different degrees by all humans irrespective to their hearing capacity. The most well know study in this respect was performed by Harry McGurk and John MacDonald in 1976. In their experiment the two researchers were trying to understand the perception of speech by children. Their finding, now called the McGurk effect, published in Nature (Mcgurk & Macdonald, 1976), was that if a person is presented a video sequence with a certain utterance (i.e. in their experiments utterance 'ga'), but in the same time the acoustics present a different utterance (i.e. in their experiments the sound 'ba'), in a large majority of cases the person will perceive a third utterance (i.e. in this case 'da'). Subsequent experiments showed that this is true as well for longer utterances and that is not a particularity of the visual and aural senses but also true for other perception functions. Therefore, lip reading is part of our multi-sensory speech perception process and could be better named visual speech recognition. Being an evolutionary acquired capacity, same as speech perception, some scientists consider the lip reading's neural mechanism the one that enables humans to achieve high literacy skills with relative easiness (van Atteveldt, 2006).

[1]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[2]  Sadaoki Furui,et al.  Audio-visual speech recognition using lip movement extracted from side-face images , 2003, AVSP.

[3]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[4]  Roland Göcke,et al.  The audio-video australian English speech data corpus AVOZES , 2012, INTERSPEECH.

[5]  Sadaoki Furui,et al.  Multi-Modal Speech Recognition Using Optical-Flow Analysis for Lip Images , 2004, J. VLSI Signal Process..

[6]  A. Caplier,et al.  Automatic and Accurate Lip Tracking , 2003 .

[7]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[8]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[9]  Matti Pietikäinen,et al.  Local spatiotemporal descriptors for visual recognition of spoken phrases , 2007, HCM '07.

[10]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[11]  Mubarak Shah,et al.  VISUALLY RECOGNIZING SPEECH USING EIGENSEQUENCES , 1997 .

[12]  Mark A. Clements,et al.  Automatic Speechreading with Applications to Human-Computer Interfaces , 2002, EURASIP J. Adv. Signal Process..

[13]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[14]  Jacek C. Wojdel,et al.  Automatic Lipreading in the Dutch Language , 2003 .

[15]  Thorsten Gernoth,et al.  Local binary patterns for lip motion analysis , 2008, 2008 15th IEEE International Conference on Image Processing.

[16]  Anton Nijholt,et al.  Classifying Visemes for Automatic Lipreading , 1999, TSD.

[17]  Lou Boves,et al.  Creation and analysis of the dutch polyphone corpus , 1994, ICSLP.

[18]  Patrice Delmas,et al.  Automatic lip tracking: Bayesian segmentation and active contours in a cooperative scheme , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[19]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[20]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[21]  Flavio Prieto,et al.  Automatic Quantitative Mouth Shape Analysis , 2007, CAIP.

[22]  Sadaoki Furui,et al.  Robust methods in automatic speech recognition and understanding , 2003, INTERSPEECH.

[23]  Alexander H. Waibel,et al.  See Me, Hear Me: Integrating Automatic Speech Recognition and Lip-reading , 1994 .

[24]  Alexander Zelinsky,et al.  Automatic Extraction of Lip Feature Points , 2000 .

[25]  Tsuhan Chen,et al.  Profile View Lip Reading , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[26]  David J. Fleet,et al.  Design and Use of Linear Models for Image Motion Analysis , 2000, International Journal of Computer Vision.

[27]  Joseph Picone,et al.  Support vector machines for speech recognition , 1998, ICSLP.

[28]  Alexander Zelinsky,et al.  Validation of an automatic lip-tracking algorithm and design of a database for audio-video speech processing , 2000 .

[29]  Rong Chen,et al.  A PCA Based Visual DCT Feature Extraction Method for Lip-Reading , 2006, 2006 International Conference on Intelligent Information Hiding and Multimedia.

[30]  L.W.J. Boves,et al.  Use of the Dutch POLYPHONE corpus for application development , 1994, Proceedings of 2nd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications.

[31]  Jean-Philippe Thiran,et al.  Mutual information eigenlips for audio-visual speech recognition , 2006, 2006 14th European Signal Processing Conference.

[32]  Stephen J. Cox,et al.  Audiovisual speech recognition using multiscale nonlinear image decomposition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[33]  K. Munhall,et al.  Spatial statistics of gaze fixations during dynamic face processing , 2007, Social neuroscience.

[34]  Guido F. Smoorenburg,et al.  Viseme classifications of Dutch consonants and vowels , 1994 .

[35]  Luc Vandendorpe,et al.  The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[36]  Alejandro F. Frangi,et al.  Lip reading for robust speech recognition on embedded devices , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[37]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[38]  Paul Deléglise,et al.  The LIUM-AVS database : a corpus to test lip segmentation and speechreading systems in natural conditions , 2003, INTERSPEECH.

[39]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[40]  Juergen Luettin,et al.  Statistical LIP modelling for visual speech recognition , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[41]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[42]  Audio-Visual Speech Recognition Using New Lip Features Extracted from Side-Face Images , 2004 .

[43]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[44]  Léon J. M. Rothkrantz,et al.  Comparison between different feature extraction techniques for audio-visual speech recognition , 2007, Journal on Multimodal User Interfaces.

[45]  Sadaoki Furui,et al.  A Robust Multimodal Speech Recognition Method using Optical Flow Analysis , 2005 .

[46]  Raúl Pinto-Elías,et al.  Lips Shape Extraction Via Active Shape Model and Local Binary Pattern , 2007, MICAI.

[47]  Johan A. du Preez,et al.  Audio-Visual Speech Recognition using SciPy , 2010 .

[48]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[49]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Gerasimos Potamianos,et al.  Lipreading Using Profile Versus Frontal Views , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[51]  Javier R. Movellan,et al.  Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[52]  Aggelos K. Katsaggelos,et al.  Frame Rate and Viseme Analysis for Multimedia Applications to Assist Speechreading , 1998, J. VLSI Signal Process..

[53]  N. M. van Atteveldt Speech meets script - fMRI studies on the integration of letter and speech sounds , 2006 .

[54]  Alexander H. Waibel,et al.  Improving connected letter recognition by lipreading , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[55]  Barry-John Theobald,et al.  Comparison of human and machine-based lip-reading , 2009, AVSP.

[56]  Matti Pietikäinen,et al.  Unsupervised texture segmentation using feature distributions , 1997, Pattern Recognit..

[57]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[59]  Jenq-Neng Hwang,et al.  Lipreading from color video , 1997, IEEE Trans. Image Process..

[60]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[61]  Léon J. M. Rothkrantz,et al.  An audio-visual corpus for multimodal speech recognition in dutch language , 2002, INTERSPEECH.

[62]  Alexander H. Waibel,et al.  Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[63]  Koji Iwano Bimodal speech recognition using lip movement measured by optical flow analysis , 2001 .

[64]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[65]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[66]  Farzin Deravi,et al.  Design issues for a digital audio-visual integrated database , 1996 .