Multi-modal speech recognition using correlativity between modality

In recent years, to achieve robust speech recognition against noises, Audio-Visual Speech Recognition(AVSR) system utilizing not only audio but also visual information of lip has been studied. This paper proposes a decision method of the weight called stream exponent representing reliability ratio of audio and visual features. The method focuses on the correlation between audio and visual modality in order to estimate the optimal stream exponent. Furthermore, we modified the stream exponent using periodicity of speech, such as pitch, to handle abrupt noises. An audio-visual database is comprised of specific speaker's lip image sequences and audio sequences. The contents of the utterance are Japanese counting numbers and sound-alike words. Using this database we constructed the AVSR system and performed an evaluation experiment. The obtained results verify the availability of the proposed method under a variety of noisy environment.

[1]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[2]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[3]  Satoshi Nakamura,et al.  Stream weight optimization of speech and lip image sequence for audio-visual speech recognition , 2000, INTERSPEECH.

[4]  Juergen Luettin,et al.  Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[5]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..