Noisy audio speech enhancement using Wiener filters derived from visual speech

The aim of this paper is to use visual speech information to create Wiener filters for audio speech enhancement. Wiener filters require estimates of both clean speech statistics and noisy speech statistics. Noisy speech statistics are obtained from the noisy input audio while obtaining clean speech statistics is more difficult and is a major problem in the creation of Wiener filters for speech enhancement. In this work the clean speech statistics are estimated from frames of visual speech that are extracted in synchrony with the audio. The estimation procedure begins by modelling the joint density of clean audio and visual speech features using a Gaussian mixture model (GMM). Using the GMM and an input visual speech vector a maximum a posterior (MAP) estimate of the audio feature is made. The effectiveness of speech enhancement using the visually-derived Wiener filter has been compared to a conventional audio-based Wiener filter implementation using a perceptual evaluation of speech quality (PESQ) analysis. PESQ scores in train noise at different signal-to-noise ratios (SNRs) show that the visuallyderived Wiener filter significantly outperforms the audioWiener filter at lower SNRs.

[1]  Barry-John Theobald,et al.  Visual speech synthesis using shape and appearance models , 2003 .

[2]  Charles W. Therrien,et al.  Discrete Random Signals and Statistical Signal Processing , 1992 .

[3]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[5]  Juergen Luettin,et al.  Asynchronous stream modeling for large vocabulary audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .

[7]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[9]  Ben P. Milner,et al.  Analysis of correlation between audio and visual speech features for clean audio feature prediction in noise , 2006, INTERSPEECH.

[10]  Ben P. Milner,et al.  Maximising audio-visual speech correlation , 2007, AVSP.

[11]  Richard M. Schwartz,et al.  Enhancement of speech corrupted by acoustic noise , 1979, ICASSP.