An approach to vowel recognition using 2DDWT based visual information of the lip region

In this paper, a vowel recognition scheme using visual information is proposed based on two dimensional discrete wavelet transform (2D-DWT). First, a video frame corresponding to a steady vowel zone is selected utilizing the speech characteristics of audio frames. Next, a pixel-based method is proposed to identify the lip region of a given video frame, where intensity variation of different color planes is utilized. The 2D-DWT is then employed on a combined image plane extracted by using the weighted sum of red and green plane pixels of the lip image. Lower order wavelet coefficients obtained after second level decomposition and differences among those coefficients are used as proposed features. Leave one out cross validation technique is used to test the classification accuracy where a distance based classifier is used. Performance of the proposed method is tested on a publicly available standard audiovisual database and a high level of recognition accuracy is achieved using only extracted visual features.

[1]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[2]  Richard M. Stern,et al.  Efficient Cepstral Normalization for Robust Speech Recognition , 1993, HLT.

[3]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[4]  Sridha Sridharan,et al.  An approach to statistical lip modelling for speaker identification via chromatic feature extraction , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[5]  Shu Hung Leung,et al.  Automatic lip contour extraction from color images , 2004, Pattern Recognit..

[6]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[7]  Kevin P. Murphy,et al.  A coupled HMM for audio-visual speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[9]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[10]  Mohammad Mehdi Hosseini,et al.  Vowel Recognition by Using the Combination of Haar Wavelet and Neural Network , 2010, KES.

[11]  Biing-Hwang Juang,et al.  Speech recognition in adverse environments , 1991 .

[12]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[14]  Fillia Makedon,et al.  Audio-visual speech recognition incorporating facial depth information captured by the Kinect , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).