论文信息 - Speaker independent audio-visual speech recognition

Speaker independent audio-visual speech recognition

We present a general framework of integrating multimodal sensory signals for spatial temporal pattern recognition. Statistical methods are used to model time varying events in a collaborative manner such that the inter-modal CO-occurrence are taken into account. We discuss various data fusion strategies, modeling of the inter-modal correlations and extracting statistical parameters for multi-modal models. A bimodal speech recognition system is implemented. A speaker-independent experiment is carried out to test the audio-visual speech recognizer under different kinds of noises from a noise database. Consistent improvements of word recognition accuracy (WRA) are achieved using a cross-validation scheme over different signal-to-noise ratios.

[1] Stephen E. Levinson,et al. Speaker independent connected word recognition using a syntax-directed dynamic programming procedure , 1982 .

[2] 平山亮. 会議報告－Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[3] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[4] Biing-Hwang Juang,et al. The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[5] Gerasimos Potamianos,et al. Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6] F. Jelinek,et al. Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.