A study on joint beamforming and spectral enhancement for robust speech recognition in reverberant environments

This work evaluates multi-microphone beamforming and single-microphone spectral enhancement strategies to alleviate the reverberation effect for robust automatic speech recognition (ASR) systems in different reverberant environments characterized by different reverberation times T60 and direct-to-reverberation ratios (DRRs). The systems consist of minimum variance distortionless response (MVDR) beamformers in combination with minimum mean square error (MMSE) estimators, and late reverberation spectral variance (LRSV) estimators, the latter employing a generalized model of the room impulse response (RIR). Various system architectures are analyzed with a focus on optimal speech recognition performance. The system combining an MVDR beamformer and a subsequent MMSE estimator was found to lead to the best results, with relative reductions of 27.7% compared to the baseline system. This is attributed to a more accurate LRSV estimate from spatial averaging and diffuse field refinement for the MMSE estimator.

[1]  Joerg Bitzer,et al.  Post-Filtering Techniques , 2001, Microphone Arrays.

[2]  R. Zelinski,et al.  A microphone array with adaptive post-filtering for noise reduction in reverberant rooms , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[3]  A. Lee Swindlehurst,et al.  A Performance Analysis ofSubspace-Based Methods in thePresence of Model Errors { Part I : The MUSIC AlgorithmA , 1992 .

[4]  Tomohiro Nakatani,et al.  Robustness against reverberation for automatic speech recognition ) , 2012 .

[5]  M. Schroeder New Method of Measuring Reverberation Time , 1965 .

[6]  Klaus Uwe Simmer,et al.  Superdirective Microphone Arrays , 2001, Microphone Arrays.

[7]  E.A.P. Habets,et al.  Towards multi-microphone speech dereverberation using spectral enhancement and statistical reverberation models , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[8]  Hervé Bourlard,et al.  Microphone array post-filter based on noise field coherence , 2003, IEEE Trans. Speech Audio Process..

[9]  Armin Sehr,et al.  Reverberation Modeling for Robust Distant-Talking Speech Recognition , 2010 .

[10]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[11]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[12]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Heinrich Kuttruff,et al.  Room acoustics , 1973 .

[14]  Eap Emanuël Habets Single- and multi-microphone speech dereverberation using spectral enhancement , 2007 .

[15]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Emanuel A. P. Habets,et al.  Late Reverberant Spectral Variance Estimation Based on a Statistical Model , 2009, IEEE Signal Processing Letters.

[17]  N. Gaubitch,et al.  ANALYSIS OF THE DEREVERBERATION PERFORMANCE OF MICROPHONE ARRAYS , 2005 .

[18]  Peter Vary,et al.  Binaural dereverberation based on a dual-channel Wiener filter with optimized noise field coherence , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[20]  Rainer Martin,et al.  Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jwu-Sheng Hu,et al.  Multi-channel post-filtering based on spatial coherence measure , 2014, Signal Process..

[22]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[23]  S. Doclo,et al.  JOINT DEREVERBERATION AND NOISE REDUCTION USING BEAMFORMING AND A SINGLE-CHANNEL SPEECH ENHANCEMENT SCHEME , 2014 .

[24]  Jont B. Allen,et al.  Multimicrophone signal‐processing technique to remove room reverberation from speech signals , 1977 .

[25]  Niko Moritz,et al.  Robust ASR in Reverberant Environments using Temporal Cepstrum Smoothing for Speech Enhancement and an Amplitude Modulation Filterbank for Feature Extraction , 2014 .

[26]  Torsten Dau,et al.  Binaural dereverberation based on interaural coherence histograms. , 2013, The Journal of the Acoustical Society of America.

[27]  Don H. Johnson,et al.  Array Signal Processing: Concepts and Techniques , 1993 .

[28]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[29]  Emanuel A. P. Habets,et al.  On the application of reverberation suppression to robust speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Patrick A. Naylor,et al.  Noise-robust reverberation time estimation using spectral decay distributions with reduced computational cost , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Yannick Mahieux,et al.  Analysis of noise reduction and dereverberation techniques based on microphone arrays with postfiltering , 1998, IEEE Trans. Speech Audio Process..

[32]  Walter Kellermann,et al.  Unbiased coherent-to-diffuse ratio estimation for dereverberation , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[33]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[34]  Stefan Goetze,et al.  Blind estimation of reverberation time based on spectro-temporal modulation filtering , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.