Local spatiotemporal descriptors for visual recognition of spoken phrases

Visual speech information plays an important role in speech recognition under noisy conditions or for listeners with hearing impairment. In this paper, we propose local spatiotemporal descriptors to represent and recognize spoken isolated phrases based solely on visual input. Positions of the eyes determined by a robust face and eye detector are used for localizing the mouth regions in face images. Spatiotemporal local binary patterns extracted from these regions are used for describing phrase sequences. In our experiments with 817 sequences from ten phrases and 20 speakers, promising accuracies of 62% and 70% were obtained in speaker-independent and speaker-dependent recognition, respectively. In comparison with other methods on the Tulips1 audio-visual database, the accuracy 92.7% of our method clearly out performs the others. Advantages of our approach include local processing and robustness to monotonic gray-scale changes. Moreover, no error prone segmentation of moving lips is needed.

[1]  N. Michael Brooke,et al.  Using the visual component in automatic speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[2]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[3]  Jean-Philippe Thiran,et al.  Mutual information eigenlips for audio-visual speech recognition , 2006, 2006 14th European Signal Processing Conference.

[4]  Jean-Philippe Thiran,et al.  Audio-visual speech recognition with a hybrid SVM-HMM system , 2005, 2005 13th European Signal Processing Conference.

[5]  Trevor Darrell,et al.  Visual speech recognition with loosely synchronized feature streams , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[6]  Jenq-Neng Hwang,et al.  Lipreading from color video , 1997, IEEE Trans. Image Process..

[7]  Chalapathy Neti,et al.  Audio-visual large vocabulary continuous speech recognition in the broadcast domain , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[8]  Conrad Sanderson,et al.  The VidTIMIT Database , 2002 .

[9]  Robert Frischholz,et al.  BioID: A Multimodal Biometric Identification System , 2000, Computer.

[10]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[11]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Aggelos K. Katsaggelos,et al.  Product HMMs for audio-visual continuous speech recognition using facial animation parameters , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[13]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[14]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[15]  Matti Pietikäinen,et al.  Face Description with Local Binary Patterns: Application to Face Recognition , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[17]  Partha Niyogi,et al.  Feature based representation for audio-visual speech recognition , 1999, AVSP.

[18]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[19]  Jeff A. Bilmes,et al.  DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[21]  Alexander H. Waibel,et al.  Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[23]  Matti Pietikäinen,et al.  Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[25]  Aggelos K. Katsaggelos,et al.  Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features , 2002, EURASIP J. Adv. Signal Process..

[26]  Sébastien Marcel,et al.  Local binary patterns as an image preprocessing for face authentication , 2006, 7th International Conference on Automatic Face and Gesture Recognition (FGR06).

[27]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[28]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[29]  Trevor Darrell,et al.  Production domain modeling of pronunciation for visual speech recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[30]  Juergen Luettin,et al.  Speaker identification by lipreading , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[31]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[32]  Ralph Gross,et al.  Person identification using automatic integration of speech, lip, and face experts , 2003, WBMA '03.