Problems associated with current area-based visual speech feature extraction techniques

Techniques such as principle component analysis (PCA), linear discriminant analysis (LDA) and the discrete cosine transform (DCT) have all been used to good effect in face recognition. As these techniques are able to compactly represent a set of features, researchers have sought to use these methods to extract the visual speech content for audio-visual speech recognition (AVSR). In this paper, we expose the problems of employing such techniques in AVSR by running some visual-only speech recognition experiments. The results of these experiments illustrate that current area-based feature extraction techniques are heavily dependent on the visual front-end, as well as being ineffective in decoupling adequate speech content from a speaker’s mouth. As a potential solution, we introduce the concept of a free-parts representation, which may be able to circumvent many of these problems experienced by current area-based techniques.

[1]  Martin Heckmann,et al.  Effects of image distortions on audio-visual speech recognition , 2003, AVSP.

[2]  Chalapathy Neti,et al.  Improved ROI and within frame discriminant features for lipreading , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[3]  F. Lavagetto,et al.  Converting speech into lip movements: a multimedia telephone for hard of hearing people , 1995 .

[4]  Sridha Sridharan,et al.  Robust Face Localisation Using Motion, Colour and Fusion , 2003, DICTA.

[5]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Sridha Sridharan,et al.  Improved Facial-Feature Detection for AVSP via Unsupervised Clustering and Discriminant Analysis , 2003, EURASIP J. Adv. Signal Process..

[7]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[8]  Simon Lucey The Symbiotic Relationship of Parts and Monolithic Face Representations in Verification , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[9]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[10]  Alexander H. Waibel,et al.  See Me, Hear Me: Integrating Automatic Speech Recognition and Lip-reading , 1994 .

[11]  John N. Gowdy,et al.  An audio-visual approach to simultaneous-speaker speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[13]  Sridha Sridharan,et al.  An approach to statistical lip modelling for speaker identification via chromatic feature extraction , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[14]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[15]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[17]  Lukáš Burget,et al.  PHONEME RECOGNITION OF MEETINGS USING AUDIO-VISUAL DATA , 2004 .

[18]  Simon Lucey An Evaluation of Visual Speech Features for the Tasks of Speech and Speaker Recognition , 2003, AVBPA.

[19]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..