Neural Network-Based Spectrum Estimation for Online WPE Dereverberation

In this paper, we propose a novel speech dereverberation framework that utilizes deep neural network (DNN)-based spectrum estimation to construct linear inverse filters. The proposed dereverberation framework is based on the state-of-the-art inverse filter estimation algorithm called weighted prediction error (WPE) algorithm, which is known to effectively reduce reverberation and greatly boost the ASR performance in various conditions. In WPE, the accuracy of the inverse filter estimation, and thus the deverberation performance, is largely dependent on the estimation of the power spectral density (PSD) of the target signal. Therefore, the conventional WPE iteratively performs the inverse filter estimation, actual dereverberation and the PSD estimation to gradually improve the PSD estimate. However, while such iterative procedure works well when sufficiently long acoustically-stationary observed signals are available, WPE’s performance degrades when the duration of observed/accessible data is short, which typically is the case for real-time applications using online block-batch processing with small batches. To solve this problem, we incorporate the DNN-based spectrum estimator into the framework ofWPE, because a DNN can estimate the PSD robustly even from very short observed data. We experimentally show that the proposed framework outperforms the conventional WPE, and improves the ASR performance in real noisy reverberant environments in both single-channel and multichannel cases.

[1]  John McDonough,et al.  Distant Speech Recognition , 2009 .

[2]  DeLiang Wang,et al.  Speech dereverberation and denoising using complex ratio masks , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[6]  Jun Du,et al.  An Experimental Study on Speech Enhancement Based on Deep Neural Networks , 2014, IEEE Signal Processing Letters.

[7]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Masato Miyoshi,et al.  Inverse filtering of room acoustics , 1988, IEEE Trans. Acoust. Speech Signal Process..

[9]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[10]  Toon van Waterschoot,et al.  Multi-channel linear prediction-based speech dereverberation with low-rank power spectrogram approximation , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Masakiyo Fujimoto,et al.  Strategies for distant speech recognitionin reverberant environments , 2015, EURASIP J. Adv. Signal Process..

[12]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[15]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  F. Weninger,et al.  DUAL SYSTEM COMBINATION APPROACH FOR VARIOUS REVERBERANT ENVIRONMENTS WITH DEREVERBERATION TECHNIQUES , 2014 .

[17]  Takuya Yoshioka,et al.  Blind Separation and Dereverberation of Speech Mixtures by Joint Optimization , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[19]  Reinhold Häb-Umbach,et al.  Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[21]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[22]  Yuuki Tachioka,et al.  Deep recurrent de-noising auto-encoder and blind de-reverberation for reverberated speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[24]  Jasha Droppo,et al.  Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.