论文信息 - Neural Network Based Time-Frequency Masking and Steering Vector Estimation for Two-Channel Mvdr Beamforming

Neural Network Based Time-Frequency Masking and Steering Vector Estimation for Two-Channel Mvdr Beamforming

We present a neural network based approach to two-channel beamforming. First, single- and cross-channel spectral features are extracted to form a feature map for each utterance. A large neural network that is the concatenation of a convolution neural network (CNN), long short-term memory recurrent neural network (LSTM-RNN) and deep neural network (DNN) is then employed to estimate frame-level speech and noise masks. Later, these predicted masks are used to compute cross-power spectral density (CPSD) matrices which are used to estimate the minimum variance distortion-less response (MVDR) beamformer coefficients. In the end, a DNN is trained to optimize the phase in the estimated steering vectors to make it robust for reverberant conditions. We compare our methods with two state-of-the-art two-channel speech enhancement systems, i.e., time-frequency masking and masking-based beamforming. Results show the proposed method leads to 21 % relative improvement in word error rate (WER) over other systems.

Yuzhou Liu | Anshuman Ganguly | Trausti Kristjansson | Krishna Kamath

[1] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[2] Richard M. Stern,et al. Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain , 2009, INTERSPEECH.

[3] Jonathan Le Roux,et al. Multi-microphone speech recognition integrating beamforming, robust feature extraction, and advanced DNN/RNN backend , 2017, Comput. Speech Lang..

[4] Eric A. Lehmann,et al. Diffuse Reverberation Model for Efficient Image-Source Simulation of Room Impulse Responses , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[5] Tara N. Sainath,et al. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Guy J. Brown,et al. Exploiting deep neural networks and head movements for binaural localisation of multiple speakers in reverberant conditions , 2015, INTERSPEECH.

[7] Stephan Gerlach,et al. Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech , 2015, EURASIP J. Adv. Signal Process..

[8] Tomohiro Nakatani,et al. Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10] Ángel M. Gómez,et al. Dual-channel DNN-based speech enhancement for smartphones , 2017, 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP).

[11] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[12] Takuya Yoshioka,et al. Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Jun Du,et al. A regression approach to binaural speech segregation via deep neural network , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[14] Jonathan Le Roux,et al. Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[15] Reinhold Häb-Umbach,et al. Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Parham Aarabi,et al. Phase-based dual-microphone robust speech enhancement , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[17] Pasi Pertilä,et al. Robust direction estimation with convolutional neural networks based steered response power , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Jon Barker,et al. An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20] Emanuel A. P. Habets,et al. Broadband doa estimation using convolutional neural networks trained with noise signals , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[21] Kazunori Komatani,et al. Sound source localization based on deep neural networks with directional activate function exploiting phase information , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Te-Won Lee,et al. Blind Speech Separation , 2007, Blind Speech Separation.

[23] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[24] Yi Jiang,et al. Binaural Classification for Reverberant Speech Segregation Using Deep Neural Networks , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25] Mohsen Rahmani,et al. Noise cross PSD estimation using phase information in diffuse noise field , 2009, Signal Process..