TOWARD ROBUST MULTIMODAL SPEECH RECOGNITION

In this paper, a robust multimodal speech recognition system is proposed in order to improve the performance of automatic speech recognition (ASR). A visual feature extraction technique for real-world data is developed and implemented. Multi-stream hidden Markov models (HMMs) including weighting factors are used to combine audio and visual information, applying the stream-weight optimization scheme based on an output likelihood maximization criterion. The proposed system is evaluated using Japanese connected digit speech recorded in real environments. Using about 10 seconds speech data for stream-weight optimization, a 30% relative error reduction is achieved compared to the result before optimization. By additionally applying the noise adaptation, roughly 60% relative error reduction is obtained over the audio-only scheme using a small amount of optimization data.