Mask-based MVDR Beamformer for Noisy Multisource Environments: Introduction of Time-varying Spatial Covariance Model

This paper proposes a method for designing a time-varying minimum variance distortionless response (MVDR) beamformer using time-frequency masks, with the aim of improving speech enhancement in noisy multi-speaker environments. A key to successful beamforming is to estimate accurately a time-varying spatial covariance matrix (SCM) for noise composed of both stationary diffuse noise and highly time-varying speech. For this purpose, we introduce a stochastic model that can represent the time-varying characteristics of the noise SCM, and derive a method for estimating a time-varying noise SCM based on the model. Experiments show that the proposed method can substantially improve the performance of the beamformer in terms of automatic speech recognition (ASR) accuracy and source-to-distortion ratio compared with a conventional time-invariant MVDR beamformer.

[1]  Takuya Yoshioka,et al.  Exploring Practical Aspects of Neural Mask-Based Beamforming for Far-Field Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Reinhold Häb-Umbach,et al.  Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings , 2017, INTERSPEECH.

[3]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[4]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Tomohiro Nakatani,et al.  Spatial correlation model based observation vector clustering and MVDR beamforming for meeting recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  DeLiang Wang,et al.  On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis , 2005, Speech Separation by Humans and Machines.

[7]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Tomohiro Nakatani,et al.  Permutation-Free Cgmm: Complex Gaussian Mixture Model with Inverse Wishart Mixture Model Based Spatial Prior for Permutation-Free Source Separation and Source Counting , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Sharon Gannot,et al.  Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Scott Rickard,et al.  Blind separation of speech mixtures via time-frequency masking , 2004, IEEE Transactions on Signal Processing.

[13]  Tara N. Sainath,et al.  Performance of Mask Based Statistical Beamforming in a Smart Home Scenario , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Jacob Benesty,et al.  An Integrated Solution for Online Multichannel Noise Tracking and Reduction , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Shinji Watanabe,et al.  Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline , 2018, INTERSPEECH.

[17]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[18]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Tomohiro Nakatani,et al.  Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tomohiro Nakatani,et al.  Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Tomohiro Nakatani,et al.  Listening to Each Speaker One by One with Recurrent Selective Hearing Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Hiroshi Sawada,et al.  A Multichannel MMSE-Based Framework for Speech Source Separation and Noise Reduction , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[25]  Reinhold Häb-Umbach,et al.  Blind speech separation employing directional statistics in an Expectation Maximization framework , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Jon Barker,et al.  An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[27]  Hiroshi Sawada,et al.  Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors , 2007, Signal Process..