Performance comparison of real-time single-channel speech dereverberation algorithms

This paper investigates four single-channel speech dereverberation algorithms, i.e., two unsupervised approaches based on (i) spectral enhancement and (ii) linear prediction, as well as two supervised approaches relying on machine learning which incorporate deep neural networks to predict either (iii) the magnitude spectrogram or (iv) the ideal ratio mask. The relative merits of the four algorithms in terms of several objective measures, automatic speech recognition performance, robustness against noise, variations between simulated and recorded reverberant speech, computation time and latency are discussed. Experimental results show that all four algorithms are capable of providing benefits in reverberant environments even with moderate background noises. In addition, low complexity and latency indicate their potential for real-time applications.

[1]  Tao Zhang,et al.  Learning Spectral Mapping for Speech Dereverberation and Denoising , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  DeLiang Wang,et al.  On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[4]  Emanuel A. P. Habets,et al.  Late Reverberant Spectral Variance Estimation Based on a Statistical Model , 2009, IEEE Signal Processing Letters.

[5]  DeLiang Wang,et al.  Exploring Monaural Features for Classification-Based Speech Segregation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[7]  Pierre Divenyi Speech Separation by Humans and Machines , 2004 .

[8]  Jesper Jensen,et al.  An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  DocloSimon,et al.  Multi-channel linear prediction-based speech dereverberation with sparse priors , 2015 .

[10]  Ina Kodrasi,et al.  Frequency-domain single-channel inverse filtering for speech dereverberation: Theory and practice , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[12]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[13]  H. Sabine Room Acoustics , 1953, The SAGE Encyclopedia of Human Communication Sciences and Disorders.

[14]  J.-M. Boucher,et al.  A New Method Based on Spectral Subtraction for Speech Dereverberation , 2001 .

[15]  Tomohiro Nakatani,et al.  Dereverberation for reverberation-robust microphone arrays , 2013, 21st European Signal Processing Conference (EUSIPCO 2013).

[16]  Toon van Waterschoot,et al.  Multi-Channel Linear Prediction-Based Speech Dereverberation With Sparse Priors , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Birger Kollmeier,et al.  An Auditory Inspired Amplitude Modulation Filter Bank for Robust Feature Extraction in Automatic Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Niko Moritz,et al.  Front-end technologies for robust ASR in reverberant environments—spectral enhancement-based dereverberation and auditory modulation filterbank features , 2015, EURASIP J. Adv. Signal Process..

[19]  Stefan Goetze,et al.  Joint Estimation of Reverberation Time and Direct-to-Reverberation Ratio from Speech using Auditory-Inspired Features , 2015, ArXiv.

[20]  Toon van Waterschoot,et al.  Adaptive Speech Dereverberation Using Constrained Sparse Multichannel Linear Prediction , 2017, IEEE Signal Processing Letters.

[21]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[22]  Li-Rong Dai,et al.  A Regression Approach to Speech Enhancement Based on Deep Neural Networks , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Oldooz Hazrati,et al.  Blind binary masking for reverberation suppression in cochlear implants. , 2013, The Journal of the Acoustical Society of America.

[24]  Stephan Gerlach,et al.  Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech , 2015, EURASIP J. Adv. Signal Process..

[25]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[27]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[28]  Rainer Martin,et al.  Parameterized MMSE spectral magnitude estimation for the enhancement of noisy speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Takuya Yoshioka,et al.  Integrated Speech Enhancement Method Using Noise Suppression and Dereverberation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Yi Hu,et al.  Evaluation of Objective Quality Measures for Speech Enhancement , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[31]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[32]  Nicoleta Roman,et al.  Intelligibility of reverberant noisy speech with ideal binary masking. , 2011, The Journal of the Acoustical Society of America.

[33]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[34]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..