Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis

Abstract Bimodal speech recognition is a novel extension of acoustic speech recognition for which both acoustic and visual speech information are used to improve the recognition accuracy in noisy environments. Although various bimodal speech systems have been developed, a rigorous and detailed comparison of the possible geometric visual features from speakers' faces has not been given yet in the previous papers. Thus, in this paper, the geometric visual features are compared and analyzed rigorously for their importance in bimodal speech recognition. The relevant information of each possible single visual feature is used to determine the best combination of geometric visual features for both visual-only and bimodal speech recognition. From the geometric visual features analyzed, lip vertical aperture is the most relevant; and the set formed by the vertical and horizontal lip apertures and the first order derivative of the lip corner angle gives the best results among the possibilities of reduced set of geometric features that were analyzed. Also, in this paper, the effect of the modelling parameters of hidden Markov models (HMM) on each single geometric lip feature's recognition accuracy is analyzed. Finally, the accuracy of acoustic-only, visual-only, and bimodal speech recognition methods are experimentally determined and compared using the optimized HMMs and geometric visual features. Compared to acoustic and visual-only speech recognition, the bimodal speech recognition scheme has a much improved recognition accuracy using the geometric visual features, especially in the presence of noise. The results obtained showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal to noise ratio (SNR) of 0 dB).

[1]  A C Bovik,et al.  Automatic lipreading. , 1993, Biomedical sciences instrumentation.

[2]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[3]  Alexandrina Rogozan,et al.  Adaptive fusion of acoustic and visual sources for automatic speech recognition , 1998, Speech Commun..

[4]  M. Lincoln,et al.  Towards pose-independent face recognition , 2000 .

[5]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[6]  Thomas S. Huang,et al.  Real-time lip tracking and bimodal continuous speech recognition , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[7]  E. Vatikiotis-Bateson,et al.  Eye movement of perceivers during audiovisualspeech perception , 1998, Perception & psychophysics.

[8]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[9]  Claudio Becchetti,et al.  Speech Recognition: Theory and C++ Implementation , 1999 .

[10]  Horst Bunke,et al.  Lipreading using signal analysis over time , 1999, Signal Process..

[11]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[12]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[13]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[14]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[15]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[18]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[19]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[20]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[21]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[22]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[23]  I. H. Öğüş,et al.  NATO ASI Series , 1997 .

[24]  Juergen Luettin,et al.  Speechreading using shape and intensity information , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[25]  Jian Zhang,et al.  Real-time lip tracking for virtual lip implementation in virtual environments and computer games , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[26]  Sabri Gurbuz,et al.  Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[27]  Chalapathy Neti,et al.  Audio-visual large vocabulary continuous speech recognition in the broadcast domain , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[28]  Ronald A. Cole,et al.  Connected digit recognition experiments with the OGI Toolkit's neural network and HMM-based recognizers , 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376).

[29]  Jean-Luc Schwartz,et al.  Comparing models for audiovisual fusion in a noisy-vowel recognition task , 1999, IEEE Trans. Speech Audio Process..