TOWARD ROBUST MULTIMODAL SPEECH RECOGNITION
暂无分享,去创建一个
In this paper, a robust multimodal speech recognition system is proposed in order to improve the performance of automatic speech recognition (ASR). A visual feature extraction technique for real-world data is developed and implemented. Multi-stream hidden Markov models (HMMs) including weighting factors are used to combine audio and visual information, applying the stream-weight optimization scheme based on an output likelihood maximization criterion. The proposed system is evaluated using Japanese connected digit speech recorded in real environments. Using about 10 seconds speech data for stream-weight optimization, a 30% relative error reduction is achieved compared to the result before optimization. By additionally applying the noise adaptation, roughly 60% relative error reduction is obtained over the audio-only scheme using a small amount of optimization data.
[1] Sadaoki Furui,et al. Audio-visual speech recognition using lip movement extracted from side-face images , 2003, AVSP.
[2] Sadaoki Furui,et al. A Robust Multimodal Speech Recognition Method using Optical Flow Analysis , 2005 .
[3] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..