A reverberation-time-aware DNN approach leveraging spatial information for microphone array dereverberation

A reverberation-time-aware deep-neural-network (DNN)-based multi-channel speech dereverberation framework is proposed to handle a wide range of reverberation times (RT60s). There are three key steps in designing a robust system. First, to accomplish simultaneous speech dereverberation and beamforming, we propose a framework, namely DNNSpatial, by selectively concatenating log-power spectral (LPS) input features of reverberant speech from multiple microphones in an array and map them into the expected output LPS features of anechoic reference speech based on a single deep neural network (DNN). Next, the temporal auto-correlation function of received signals at different RT60s is investigated to show that RT60-dependent temporal-spatial contexts in feature selection are needed in the DNNSpatial training stage in order to optimize the system performance in diverse reverberant environments. Finally, the RT60 is estimated to select the proper temporal and spatial contexts before feeding the log-power spectrum features to the trained DNNs for speech dereverberation. The experimental evidence gathered in this study indicates that the proposed framework outperforms the state-of-the-art signal processing dereverberation algorithm weighted prediction error (WPE) and conventional DNNSpatial systems without taking the reverberation time into account, even for extremely weak and severe reverberant conditions. The proposed technique generalizes well to unseen room size, array geometry and loudspeaker position, and is robust to reverberation time estimation error.

[1]  T. Aaron Gulliver,et al.  Single-Microphone Early and Late Reverberation Suppression in Noisy Speech , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Boaz Rafaely,et al.  Microphone Array Signal Processing , 2008 .

[3]  Michael S. Brandstein,et al.  Microphone Arrays - Signal Processing Techniques and Applications , 2001, Microphone Arrays.

[4]  Tomohiro Nakatani,et al.  Suppression of Late Reverberation Effect on Speech Signal Using Long-Term Multiple-step Linear Prediction , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  DeLiang Wang,et al.  A two-stage algorithm for one-microphone reverberant speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[7]  Biing-Hwang Juang,et al.  Blind speech dereverberation with multi-channel linear prediction based on short time fourier transform representation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Edward J. Wegman,et al.  Statistical Signal Processing , 1985 .

[10]  Emanuel A. P. Habets,et al.  A Two-Stage Beamforming Approach for Noise Reduction and Dereverberation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  E. Lehmann,et al.  Prediction of energy decay in room impulse responses simulated with an image-source model. , 2008, The Journal of the Acoustical Society of America.

[12]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[13]  Mohamed-Slim Alouini,et al.  Instantly decodable network coding for real-time device-to-device communications , 2016, EURASIP J. Adv. Signal Process..

[14]  Pasi Pertilä,et al.  Distant speech separation using predicted time-frequency masks from spatial features , 2015, Speech Commun..

[15]  Chin-Hui Lee,et al.  A unified deep modeling approach to simultaneous speech dereverberation and recognition for the reverb challenge , 2017, 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA).

[16]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  T. Aaron Gulliver,et al.  Speech-Model Based Accurate Blind Reverberation Time Estimation Using an LPC Filter , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Robert B. Newman,et al.  Collected Papers on Acoustics , 1927 .

[20]  Henrique S. Malvar,et al.  Speech dereverberation via maximum-kurtosis subband adaptive filtering , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[21]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[22]  Rodney A. Kennedy,et al.  Spatial aliasing for near-field sensor arrays , 1999 .

[23]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[24]  Jont B. Allen,et al.  Invertibility of a room impulse response , 1979 .

[25]  C. Gardiner Handbook of Stochastic Methods , 1983 .

[26]  Yi Hu,et al.  Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. , 2009, The Journal of the Acoustical Society of America.

[27]  Chin-Hui Lee,et al.  A Reverberation-Time-Aware Approach to Speech Dereverberation Based on Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Antoine Liutkus,et al.  Robust ASR using neural network based speech enhancement and feature simulation , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[29]  DeLiang Wang,et al.  Learning spectral mapping for speech dereverberation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Georgios B. Giannakis,et al.  Statistical Signal Processing, Higher Order Tools , 1999 .

[31]  Chin-Hui Lee,et al.  A deep neural network approach to speech bandwidth expansion , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Michael S. Brandstein,et al.  A microphone array system for speech source localization, denoising, and dereverberation , 2002 .

[33]  Toon van Waterschoot,et al.  Multi-Channel Linear Prediction-Based Speech Dereverberation With Sparse Priors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[34]  Chin-Hui Lee,et al.  DNN-based speech bandwidth expansion and its application to adding high-frequency missing features for automatic speech recognition of narrowband speech , 2015, INTERSPEECH.

[35]  Fred J. Taylor,et al.  Advanced Digital Signal Processing: Theory and Applications , 1993 .

[36]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[37]  Rodney A. Kennedy,et al.  Equalization in an acoustic reverberant environment: robustness results , 2000, IEEE Trans. Speech Audio Process..

[38]  Jae Lim,et al.  Signal estimation from modified short-time Fourier transform , 1984 .

[39]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[40]  Alastair H. Moore,et al.  The ACE challenge — Corpus description and performance evaluation , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[41]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[43]  Boaz Rafaely,et al.  Spatial Aliasing in Spherical Microphone Arrays , 2007, IEEE Transactions on Signal Processing.

[44]  Ronald E. Crochiere,et al.  A study of complexity and quality of speech waveform coders , 1978, ICASSP.

[45]  Takuya Yoshioka,et al.  Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[46]  Jacob Benesty,et al.  On Spatial Aliasing in Microphone Arrays , 2009, IEEE Transactions on Signal Processing.

[47]  Patrick A. Naylor,et al.  Speech Dereverberation , 2010 .

[48]  Peter Vary,et al.  Dual-Channel Speech Enhancement by Superdirective Beamforming , 2006, EURASIP J. Adv. Signal Process..

[49]  G. K.,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2017 .

[50]  Jun Du,et al.  A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions , 2008, INTERSPEECH.