Noisy audio feature enhancement using audio-visual speech data

We investigate improving automatic speech recognition (ASR) in noisy conditions by enhancing noisy audio features using visual speech captured from the speaker's face. The enhancement is achieved by applying a linear filter to the concatenated vector of noisy audio and visual features, obtained by mean square error estimation of the clean audio features in a training stage. The performance of the enhanced audio features is evaluated on two ASR tasks: A connected digits task and speaker-independent, large-vocabulary, continuous speech recognition. In both cases and at sufficiently low signal-to-noise ratios (SNRs), ASR trained on the enhanced audio features significantly outperforms ASR trained on the noisy audio, achieving for example a 46% relative reduction in word error rate on the digits task at −3.5 dB SNR. However, the method fails to capture the full visual modality benefit to ASR, as demonstrated by its comparison to discriminant audio-visual feature fusion introduced in previous work.

[1]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[2]  Laurent Girin,et al.  Speech signals separation: a new approach exploiting the coherence of audio and visual speech , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[3]  B. Dodd,et al.  Hearing by Eye II , 1998 .

[4]  David G. Stork,et al.  Speech recognition and sensory integration , 1998 .

[5]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[6]  Jon Barker,et al.  Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models , 1999, AVSP.

[7]  Gang Feng,et al.  Noisy speech enhancement with filters estimated from the speaker's lips , 1995, EUROSPEECH.

[8]  Gang Feng,et al.  Audiovisual Speech Coder : Using Vector Quantization To Exploit The Audio/Video Correlation , 1998, AVSP.

[9]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[11]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[12]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[13]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[14]  L. Girin,et al.  Fusion of auditory and visual information for noisy speech enhancement: a preliminary study of vowel transitions , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).