Integrating Neural Network Based Beamforming and Weighted Prediction Error Dereverberation

The weighted prediction error (WPE) algorithm has proven to be a very successful dereverberation method for the REVERB challenge. Likewise, neural network based mask estimation for beamforming demonstrated very good noise suppression in the CHiME 3 and CHiME 4 challenges. Recently, it has been shown that this estimator can also be trained to perform dereverberation and denoising jointly. However, up to now a comparison of a neural beamformer and WPE is still missing, so is an investigation into a combination of the two. Therefore, we here provide an extensive evaluation of both and consequently propose variants to integrate deep neural network based beamforming with WPE. For these integrated variants we identify a consistent word error rate (WER) reduction on two distinct databases. In particular, our study shows that deep learning based beamforming benefits from a model-based dereverberation technique (i.e. WPE) and vice versa. Our key findings are: (a) Neural beamforming yields the lower WERs in comparison to WPE the more channels and noise are present. (b) Integration of WPE and a neural beamformer consistently outperforms all stand-alone systems.

[1]  Jonathan Le Roux,et al.  Multi-Channel Speech Recognition : LSTMs All the Way Through , 2016 .

[2]  Emmanuel Vincent,et al.  A French Corpus for Distant-Microphone Speech Processing in Real Homes , 2016, INTERSPEECH.

[3]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[4]  Reinhold Häb-Umbach,et al.  A generic neural acoustic beamforming architecture for robust multi-channel speech processing , 2017, Comput. Speech Lang..

[5]  Reinhold Haeb-Umbach,et al.  Wide Residual BLSTM Network with Discriminative Speaker Adaptation for Robust Speech Recognition , 2016 .

[6]  Arun Narayanan,et al.  Adaptive Multichannel Dereverberation for Automatic Speech Recognition , 2017, INTERSPEECH.

[7]  Tomohiro Nakatani,et al.  Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Hermann Ney,et al.  The RWTH/UPB/FORTH System Combination for the 4th CHiME Challenge Evaluation , 2016 .

[9]  Tomohiro Nakatani,et al.  Neural Network-Based Spectrum Estimation for Online WPE Dereverberation , 2017, INTERSPEECH.

[10]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[11]  Biing-Hwang Juang,et al.  Blind speech dereverberation with multi-channel linear prediction based on short time fourier transform representation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Mitch Weintraub,et al.  Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[13]  Takuya Yoshioka,et al.  Relaxed disjointness based clustering for joint blind source separation and dereverberation , 2014, 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC).

[14]  Shmulik Markovich Golan,et al.  Combined Weighted Prediction Error and Minimum Variance Distortionless Response for dereverberation , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tomohiro Nakatani,et al.  The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech , 2013, 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[16]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[17]  Quoc V. Le,et al.  Recurrent Neural Networks for Noise Reduction in Robust ASR , 2012, INTERSPEECH.

[18]  I. McCowan,et al.  The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): specification and initial experiments , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[19]  Sunit Sivasankaran,et al.  A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions , 2017, Comput. Speech Lang..

[20]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[21]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Cong Liu,et al.  The USTC-iFlytek System for CHiME-4 Challenge , 2016 .

[23]  Masakiyo Fujimoto,et al.  Strategies for distant speech recognitionin reverberant environments , 2015, EURASIP J. Adv. Signal Process..

[24]  Jacob Benesty,et al.  A Study of the LCMV and MVDR Noise Reduction Filters , 2010, IEEE Transactions on Signal Processing.

[25]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.