Mutual information eigenlips for audio-visual speech recognition

This paper proposes an application of information theoretic approach for finding the most informative subset of eigen-features to be used for audio-visual speech recognition tasks. The state-of-the-art visual feature extraction methods in the area of speechreading rely on either pixel or geometric based methods or their combination. However, there is no common rule defining how these features have to be selected with respect to the chosen set of audio cues and how well they represent the classes of the uttered speech. Our main objective is to exploit the complementarity of audio and visual sources and select meaningful visual descriptors by the means of mutual information. We focus on the principal components projections of the mouth region images and apply the proposed method such that only those cues having the highest mutual information with word classes are retained. The algorithm is tested by performing various speech recognition experiments on a chosen audio-visual dataset. The obtained recognition rates are compared to those acquired using a conventional principal component analysis and promising results are shown.

[1]  Alexandrina Rogozan,et al.  Asynchronous integration of audio and visual sources in bi-modal automatic speech recognition , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[2]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[3]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[5]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[6]  Juergen Luettin,et al.  Visual Speech and Speaker Recognition , 1997 .

[7]  Sarel van Vuuren,et al.  Relevance of time-frequency features for phonetic and speaker-channel classification , 2000, Speech Commun..

[8]  Javier R. Movellan,et al.  A Comparison of Image Processing Techniques for Visual Speech Recognition Applications , 2000, NIPS.

[9]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[10]  Gerasimos Potamianos,et al.  Mutual information based visual feature selection for lipreading , 2004, INTERSPEECH.

[11]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[12]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[13]  Simon Lucey An Evaluation of Visual Speech Features for the Tasks of Speech and Speaker Recognition , 2003, AVBPA.

[14]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Michael Kirby,et al.  A model problem in the representation of digital image sequences , 1993, Pattern Recognit..

[17]  Daniel P. W. Ellis,et al.  Using mutual information to design feature combinations , 2000, INTERSPEECH.