Visual information assisted Mandarin large vocabulary continuous speech recognition

We present a general system framework of Mandarin audio-visual large vocabulary continuous speech recognition (LVCSR), which integrates visual information for better recognition performance and robustness. Several problems of audio-visual LVCSR are mainly addressed: lip tracking, visual feature extraction and audio-visual fusion. Firstly, the linear transform based lip tracking and low-level visual feature extraction methods are presented in comparison with the lip contour based feature extraction. Subsequently, the audio-visual fusion strategy based on multistream hidden Markov model (MSHMM) is investigated and a novel approach is presented for training global or state-dependent stream weights using minimum classification error (MCE) criterion. It is shown by experimental results that, with the visual information introduced, the word error rate (WER) of LVCSR system is reduced by 36.09% relatively in the case of clean audio, and the system robustness is also enhanced significantly in noise environment.