Frame-dependent multi-stream reliability indicators for audio-visual speech recognition

We investigate the use of local, frame-dependent reliability indicators of the audio and visual modalities, as a means of estimating stream exponents of multi-stream hidden Markov models for audio-visual automatic speech recognition. We consider two such indicators at each modality, defined as functions of the speech-class conditional observation probabilities of appropriate audio-or visual-only classifiers. We subsequently map the four reliability indicators into the stream exponents of a state-synchronous, two-stream hidden Markov model, as a sigmoid function of their linear combination. We propose two algorithms to estimate the sigmoid weights, based on the maximum conditional likelihood and minimum classification error criteria. We demonstrate the superiority of the proposed approach on a connected-digit audio-visual speech recognition task, under varying audio channel noise conditions. Indeed, the use of the estimated, frame-dependent stream exponents results in a significantly smaller word error rate than using global stream exponents. In addition, it outperforms utterance-level exponents, even though the latter utilize a-priori knowledge of the utterance noise level.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[3]  Chalapathy Neti,et al.  Automatic speechreading of impaired speech , 2001, AVSP.

[4]  Martin Heckmann,et al.  Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[5]  Stephen J. Cox,et al.  Combining noise compensation with visual information in speech recognition , 1997, AVSP.

[6]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[7]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[8]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[9]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[10]  Hervé Glotin,et al.  Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[11]  Paul Duchnowski,et al.  Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[13]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .