Spatial diffuseness features for DNN-based speech recognition in noisy and reverberant environments

We propose a spatial diffuseness feature for deep neural network (DNN)-based automatic speech recognition to improve recognition accuracy in reverberant and noisy environments. The feature is computed in real-time from multiple microphone signals without requiring knowledge or estimation of the direction of arrival, and represents the relative amount of diffuse noise in each time and frequency bin. It is shown that using the diffuseness feature as an additional input to a DNN-based acoustic model leads to a reduced word error rate for the REVERB challenge corpus, both compared to logmelspec features extracted from noisy signals, and features enhanced by spectral subtraction.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[2]  Steve Renals,et al.  Hybrid acoustic models for distant and multichannel large vocabulary speech recognition , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  L. Danilenko Binaurales Hören im nichtstationären diffusen Schallfeld , 1969, Kybernetik.

[5]  Maja Taseska,et al.  The diffuse sound field in energetic analysis. , 2012, The Journal of the Acoustical Society of America.

[6]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[8]  Geoffrey Zweig,et al.  Recent advances in deep learning for speech research at Microsoft , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[10]  John F Culling,et al.  Speech perception from monaural and binaural information. , 2006, The Journal of the Acoustical Society of America.

[11]  Walter Kellermann,et al.  Coherent-to-Diffuse Power Ratio Estimation for Dereverberation , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  H S Colburn,et al.  Binaural sluggishness in the perception of tone sequences and speech in noise. , 2000, The Journal of the Acoustical Society of America.

[14]  Yuuki Tachioka,et al.  The MERL/MELCO/TUM system for the REVERB Challenge using Deep Recurrent Neural Network Feature Enhancement , 2014, ICASSP 2014.

[15]  Roland Maas,et al.  AT wo-Channel Acoustic Front-End for Robust Automatic Speech Recognition in Noisy and Reverberant Environments , 2011 .

[16]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Masakiyo Fujimoto,et al.  LINEAR PREDICTION-BASED DEREVERBERATION WITH ADVANCED SPEECH ENHANCEMENT AND RECOGNITION TECHNOLOGIES FOR THE REVERB CHALLENGE , 2014 .

[18]  O. Thiergart,et al.  Coherence-based diffuseness estimation in the spherical harmonic domain , 2012, 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel.

[19]  Walter Kellermann,et al.  Unbiased coherent-to-diffuse ratio estimation for dereverberation , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[20]  Ramón Fernández Astudillo,et al.  Integration of beamforming and uncertainty-of-observation techniques for robust ASR in multi-source environments , 2013, Comput. Speech Lang..

[21]  DeLiang Wang,et al.  Joint noise adaptive training for robust automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Emanuel A. P. Habets,et al.  Power-based signal-to-diffuse ratio estimation using noisy directional microphones , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[24]  Tomohiro Nakatani,et al.  Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? , 2013, INTERSPEECH.