Neural Network Based Time-Frequency Masking and Steering Vector Estimation for Two-Channel Mvdr Beamforming

We present a neural network based approach to two-channel beamforming. First, single- and cross-channel spectral features are extracted to form a feature map for each utterance. A large neural network that is the concatenation of a convolution neural network (CNN), long short-term memory recurrent neural network (LSTM-RNN) and deep neural network (DNN) is then employed to estimate frame-level speech and noise masks. Later, these predicted masks are used to compute cross-power spectral density (CPSD) matrices which are used to estimate the minimum variance distortion-less response (MVDR) beamformer coefficients. In the end, a DNN is trained to optimize the phase in the estimated steering vectors to make it robust for reverberant conditions. We compare our methods with two state-of-the-art two-channel speech enhancement systems, i.e., time-frequency masking and masking-based beamforming. Results show the proposed method leads to 21 % relative improvement in word error rate (WER) over other systems.

[1]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[2]  Richard M. Stern,et al.  Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain , 2009, INTERSPEECH.

[3]  Jonathan Le Roux,et al.  Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend , 2017, Comput. Speech Lang..

[4]  Eric A. Lehmann,et al.  Diffuse Reverberation Model for Efficient Image-Source Simulation of Room Impulse Responses , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Guy J. Brown,et al.  Exploiting deep neural networks and head movements for binaural localisation of multiple speakers in reverberant conditions , 2015, INTERSPEECH.

[7]  Stephan Gerlach,et al.  Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech , 2015, EURASIP J. Adv. Signal Process..

[8]  Tomohiro Nakatani,et al.  Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10]  Ángel M. Gómez,et al.  Dual-channel DNN-based speech enhancement for smartphones , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[11]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[12]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Jun Du,et al.  A regression approach to binaural speech segregation via deep neural network , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[14]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[15]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Parham Aarabi,et al.  Phase-based dual-microphone robust speech enhancement , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17]  Pasi Pertilä,et al.  Robust direction estimation with convolutional neural networks based steered response power , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Emanuel A. P. Habets,et al.  Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[21]  Kazunori Komatani,et al.  Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Te-Won Lee,et al.  Blind Speech Separation , 2007, Blind Speech Separation.

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  Yi Jiang,et al.  Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Mohsen Rahmani,et al.  Noise cross PSD estimation using phase information in diffuse noise field , 2009, Signal Process..