Lip Feature Extraction Using Red Exclusion

Automatic speech recognition (ASR) performs well under restricted conditions, but performance degrades in noisy environments. Audio-Visual Speech Recognition (AVSR) combats this by incorporating a visual signal into the recognition. This paper briefly reviews the contribution of psycholinguistics to this endeavour and the recent advances in machine AVSR. An important first step in AVSR is that of feature extraction from the mouth region. This paper examines several well-known pixel based techniques - grayscale, horizontal edge, red and hue colour space - and compares how well they work on our naturalistic database. Finally, a novel method of feature extraction, red exclusion, is described that outperforms the others on this data set.

[1]  Alex Waibel,et al.  Face locating and tracking for human-computer interaction , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[2]  D. Stork,et al.  Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[3]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[4]  Rama Chellappa,et al.  Human and machine recognition of faces: a survey , 1995, Proc. IEEE.

[5]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[6]  Alex Waibel,et al.  Bimodal sensor integration on the example of 'speechreading' , 1993, IEEE International Conference on Neural Networks.

[7]  Steven Finch,et al.  Finding structure in language , 1995 .

[8]  Oscar N. Garcia,et al.  Rationale for Phoneme-Viseme Mapping and Feature Selection in Visual Speech Recognition , 1996 .

[9]  Alexander H. Waibel,et al.  A real-time face tracker , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[10]  Rainer Stiefelhagen,et al.  Real-time lip-tracking for lipreading , 1997, EUROSPEECH.

[11]  Dominic W. Massaro,et al.  Perception of Synthetic Visual Speech , 1996 .

[12]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .

[13]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[14]  Javier R. Movellan,et al.  Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[15]  R. M. Mersereau,et al.  Lip modeling for visual speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[16]  Robert D. Rodman,et al.  An Introduction to Language , 1984 .

[17]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[18]  Juergen Luettin,et al.  Continuous Audio-Visual Speech Recognition , 1998, ECCV.

[19]  Sridha Sridharan,et al.  An approach to statistical lip modelling for speaker identification via chromatic feature extraction , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[20]  Alexander H. Waibel,et al.  Towards Unrestricted Lip Reading , 2000, Int. J. Pattern Recognit. Artif. Intell..

[21]  Lorenzo Torresani,et al.  2D Deformable Models for Visual Speech Analysis , 1996 .

[22]  Paul Duchnowski,et al.  Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23]  David G. Stork,et al.  Using deformable templates to infer visual speech dynamics , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[24]  Gregory J. Wolff,et al.  Preprocessing video images for neural learning of lipreading , 1994, Other Conferences.

[25]  Michael Vogt Fast Matching of a Dynamic Lip Model to Color Video Sequences under Regular Illumination Conditions , 1996 .

[26]  B. Walden,et al.  Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[27]  Alexander H. Waibel,et al.  Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.