Speaker independent audio-visual speech recognition

We present a general framework of integrating multimodal sensory signals for spatial temporal pattern recognition. Statistical methods are used to model time varying events in a collaborative manner such that the inter-modal CO-occurrence are taken into account. We discuss various data fusion strategies, modeling of the inter-modal correlations and extracting statistical parameters for multi-modal models. A bimodal speech recognition system is implemented. A speaker-independent experiment is carried out to test the audio-visual speech recognizer under different kinds of noises from a noise database. Consistent improvements of word recognition accuracy (WRA) are achieved using a cross-validation scheme over different signal-to-noise ratios.

[1]  Stephen E. Levinson,et al.  Speaker independent connected word recognition using a syntax-directed dynamic programming procedure , 1982 .

[2]  平山亮 会議報告-Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[3]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[5]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.