Integrating audio and visual information to provide highly robust speech recognition

There is a requirement in many human machine interactions to provide accurate automatic speech recognition in the presence of high levels of interfering noise. The the paper shows that performance improvements in recognition accuracy can be obtained by including data derived from a speaker's lip images. We describe the combination of the audio and visual data in the construction of composite feature vectors and a hidden Markov model structure which allows for asynchrony between the audio and visual components. These ideas are applied to a speaker dependent recognition task involving a small vocabulary and subject to interfering noise. The recognition results obtained using composite vectors and cross-product models are compared with those based on an audio-only feature vector. The benefit of this approach is shown to be an increased performance over a very wide range of noise levels.

[1]  James Holmes,et al.  The JSRU channel vocoder , 1980 .

[2]  N. M. Brooke,et al.  Computer graphics animations of talking faces based on stochastic models , 1994, Proceedings of ICSIPNN '94. International Conference on Speech, Image Processing and Neural Networks.

[3]  K. M. Ponting,et al.  Experiments with Grand Variance in the Arm Continuous Speech Recognition System , 1990 .

[4]  Dennis H. Klatt,et al.  A digital filter bank for spectral matching , 1976, ICASSP.

[5]  S. M. Peeling,et al.  The ARM continuous speech recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  Brian Mellor,et al.  Noise masking in a transform domain , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[8]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[9]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .