DNN-based speech mask estimation for eigenvector beamforming

In this paper, we present an optimal multi-channel Wiener filter, which consists of an eigenvector beamformer and a single-channel postfilter. We show that both components solely depend on a speech presence probability, which we learn using a deep neural network, consisting of a deep autoencoder and a softmax regression layer. To prevent the DNN from learning specific speaker and noise types, we do not use the signal energy as input feature, but rather the cosine distance between the dominant eigenvectors of consecutive frames of the power spectral density of the noisy speech signal. We compare our system against the BeamformIt toolkit, and state-of-the-art approaches such as the front-end of the best system of the CHiME3 challenge. We show that our system yields superior results, both in terms of perceptual speech quality and classification error.

[1]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[2]  Franz Pernkopf,et al.  Multi-channel speech processing architectures for noise robust speech recognition: 3rd CHiME challenge results , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  DeLiang Wang,et al.  Cocktail Party Processing via Structured Prediction , 2012, NIPS.

[5]  Emmanuel Vincent,et al.  Improved Perceptual Metrics for the Evaluation of Audio Source Separation , 2012, LVA/ICA.

[6]  Reinhold Häb-Umbach,et al.  Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Geoffrey E. Hinton,et al.  Binary coding of speech spectrograms using a deep auto-encoder , 2010, INTERSPEECH.

[8]  Franz Pernkopf,et al.  Single channel source separation with general stochastic networks , 2014, INTERSPEECH.

[9]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Gerald Enzner,et al.  State-space architecture of the partitioned-block-based acoustic echo controller , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[12]  S. Gannot,et al.  Speech enhancement based on the general transfer function GSC and postfiltering , 2004, IEEE Trans. Speech Audio Process..

[13]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[15]  Methods for objective and subjective assessment of quality Perceptual evaluation of speech quality ( PESQ ) : An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs , 2002 .

[16]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[17]  Emmanuel Vincent,et al.  Subjective and Objective Quality Assessment of Audio Source Separation , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Franz Pernkopf,et al.  Eigenvector-Based Speech Mask Estimation for Multi-Channel Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Walter Kellermann,et al.  Analysis of blocking matrices for generalized sidelobe cancellers for non-stationary broadband signals , 2002, ICASSP.

[20]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Akihiko Sugiyama,et al.  A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters , 1999, IEEE Trans. Signal Process..

[22]  Franz Pernkopf,et al.  Blind source extraction based on a direction-dependent a-priori SNR , 2014, INTERSPEECH.

[23]  Israel Cohen,et al.  A sparse blocking matrix for multiple constraints GSC beamformer , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[25]  Franz Pernkopf,et al.  Representation Learning for Single-Channel Source Separation and Bandwidth Extension , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.