A hybrid visual feature extraction method for audio-visual speech recognition

In this paper, a hybrid visual feature extraction method that combines the extended locally linear embedding (LLE) with visemic linear discriminant analysis (LDA) was presented for the audio-visual speech recognition (AVSR). Firstly the extended LLE is presented to reduce the dimension of the mouth images, which constrains the scope of finding mouth data neighborhood to the corresponding individual's dataset instead of the whole dataset, and then maps the high dimensional mouth image matrices into a low-dimensional Euclidean space. Secondly we project the feature vectors on the visemic linear discriminant space to find the optimal classification. Finally, in the audio-visual fusion period, the minimum classification error (MCE) training based on the segmental generalized probabilistic descent (GPD) is applied to audio and visual stream weights optimization. Experimental results conducted the CUAVE database show that the proposed method achieves a significant performance than that of the classical PCA and LDA based method in visual-only speech recognition. Further experimental results show the robustness of the MCE based discriminative training method in noisy environment.

[1]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Steve Young,et al.  The HTK book , 1995 .

[3]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[4]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[5]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[6]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[7]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[8]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[9]  Matti Pietikäinen,et al.  Supervised Locally Linear Embedding , 2003, ICANN.

[10]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Keiichi Tokuda,et al.  Audio-visual speech recognition using MCE-based hmms and model-dependent stream weights , 2000, INTERSPEECH.

[12]  Matti Pietikäinen,et al.  Supervised Locally Linear Embedding Algorithm for Pattern Recognition , 2003, IbPRIA.

[13]  S. Katagiri,et al.  Discriminative Learning for Minimum Error Classification , 2009 .

[14]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.