Robust coherence-based spectral enhancement for distant speech recognition

In this contribution to the 3rd CHiME Speech Separation and Recognition Challenge (CHiME-3) we extend the acoustic front-end of the CHiME-3 baseline speech recognition system by a coherence-based Wiener filter which is applied to the output signal of the baseline beamformer. To compute the time- and frequency-dependent postfilter gains the ratio between direct and diffuse signal components at the output of the baseline beamformer is estimated and used as approximation of the short-time signal-to-noise ratio. The proposed spectral enhancement technique is evaluated with respect to word error rates of the CHiME-3 challenge baseline speech recognition system using real speech recorded in public environments. Results confirm the effectiveness of the coherence-based postfilter when integrated into the front-end signal enhancement.

[1]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[2]  Giovanni Del Galdo,et al.  On the spatial coherence in mixed sound fields and its application to signal-to-diffuse ratio estimation. , 2012, The Journal of the Acoustical Society of America.

[3]  Walter Kellermann,et al.  Unbiased coherent-to-diffuse ratio estimation for dereverberation , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[4]  Harry L. Van Trees,et al.  Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory , 2002 .

[5]  Xiao Li,et al.  Regularized Adaptation of Discriminative Classifiers , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Kaisheng Yao,et al.  KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Emanuel A. P. Habets,et al.  Signal-to-reverberant ratio estimation based on the complex spatial coherence between omnidirectional microphones , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  E. Hänsler,et al.  Acoustic Echo and Noise Control: A Practical Approach , 2004 .

[9]  Walter Kellermann,et al.  Coherent-to-Diffuse Power Ratio Estimation for Dereverberation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[11]  Y.-Q. Wang,et al.  Model-based approaches to handling additive noise in reverberant environments , 2011, 2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays.

[12]  Hank Liao,et al.  Speaker adaptation of context dependent deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Maja Taseska,et al.  The diffuse sound field in energetic analysis. , 2012, The Journal of the Acoustical Society of America.

[14]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[15]  Tomohiro Nakatani,et al.  Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? , 2013, INTERSPEECH.

[16]  Christophe Beaugeant,et al.  Blind estimation of the coherent-to-diffuse energy ratio from noisy speech signals , 2011, 2011 19th European Signal Processing Conference.

[17]  Roland Maas,et al.  Spatial diffuseness features for DNN-based speech recognition in noisy and reverberant environments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Joerg Bitzer,et al.  Post-Filtering Techniques , 2001, Microphone Arrays.

[19]  Reinhold Häb-Umbach,et al.  Model-Based Feature Enhancement for Reverberant Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  T. Yoshioka,et al.  Environmentally robust ASR front-end for deep neural network acoustic models , 2015, Comput. Speech Lang..