Improving hands-free speech recognition in a car through audio-visual voice activity detection

In this work, we show how the speech recognition performance in a noisy car environment can be improved by combining audio-visual voice activity detection (VAD) with microphone array processing techniques. That is accomplished by enhancing the multi-channel audio signal in the speaker localization step, through per channel power spectral subtraction whose noise estimates are obtained from the non-speech segments identified by VAD. This noise reduction step improves the accuracy of the estimated speaker positions and thereby the quality of the beamformed signal of the consecutive array processing step. Audio-visual voice activity detection has the advantage of being more robust in acoustically demanding environments. This claim is substantiated through speech recognition experiments on the AVICAR corpus, where the proposed localization framework gave a WER of 7.1% in combination with delay-and-sum beamforming. This compares to a WER of 8.9% for speaker localizing with audio-only VAD and 11.6% without VAD and 15.6 for a single distant channel.

[1]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[2]  Xiaofeng Ren,et al.  Finding people in archive films through tracking , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  J. Bather,et al.  Mixture Reduction Algorithms for Uncertain Tracking , 1988 .

[4]  Juergen Luettin,et al.  Using the multi-stream approach for continuous audio-visual speech recognition: experiments on the M2VTS database , 1998, ICSLP.

[5]  Jacob Benesty,et al.  Time Delay Estimation in Room Acoustic Environments: An Overview , 2006, EURASIP J. Adv. Signal Process..

[6]  Yannick Mahieux,et al.  Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering , 1998, IEEE Trans. Speech Audio Process..

[7]  Y. Bar-Shalom Tracking and data association , 1988 .

[8]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[9]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[10]  Ben P. Milner,et al.  Using audio-visual features for robust voice activity detection in clean and noisy speech , 2008, 2008 16th European Signal Processing Conference.

[11]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[12]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[13]  Hiroshi G. Okuno,et al.  Two-layered audio-visual speech recognition for robots in noisy environments , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[15]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[16]  Gregory D. Hager,et al.  Probabilistic Data Association Methods for Tracking Complex Visual Objects , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Friedrich Faubel,et al.  Robust Gaussian Mixture Filter Based Mouth Tracking in a Real Environment , 2009 .