DNN-supported Mask-based Convolutional Beamforming for Simultaneous Denoising, Dereverberation, and Source Separation

In this article, we investigate an integrated mask-based convolutional beamforming method for performing simultaneous denoising, dereverberation, and source separation. Conventionally, it is difficult for neural network-supported mask-based source separation to perform denoising and dereverberation at the same time and for spatial clustering-based source separation to reliably solve the permutation problem in the presence of noise and reverberation. This greatly limits the application of mask-based source separation. To address this issue, we propose a method to integrate state-of-the-art techniques for mask-based beamforming into a single optimization framework. These techniques include frequency-domain Convolutional Neural Network based utterance-level Permutation Invariant Training with a large receptive field (CNN-uPIT), noisy Complex Gaussian Mixture Model based spatial clustering (noisyCGMM), and Weighted Power minimization Distortionless response (WPD) convolutional beamforming. Our experiments show that all these components are essential for accurately estimating desired speech signals in noisy reverberant multisource environments.

[1]  Chengzhu Yu,et al.  The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[2]  Tomohiro Nakatani,et al.  Jointly Optimal Dereverberation and Beamforming , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Özgür Yilmaz,et al.  Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Naoyuki Kanda,et al.  Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR , 2019, INTERSPEECH.

[6]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jonathan Le Roux,et al.  Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[9]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[10]  Hakan Erdogan,et al.  Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Masahito Togami,et al.  Local Gaussian model with source-set constraints in audio source separation , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[12]  Tomohiro Nakatani,et al.  All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tomohiro Nakatani,et al.  Integrating DNN-based and spatial clustering-based mask estimation for robust MVDR beamforming , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Hyung-Min Park,et al.  A Beamforming Algorithm Based on Maximum Likelihood of a Complex Gaussian Distribution With Time-Varying Variances for Robust Speech Recognition , 2019, IEEE Signal Processing Letters.

[17]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[18]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Reinhold Häb-Umbach,et al.  A generic neural acoustic beamforming architecture for robust multi-channel speech processing , 2017, Comput. Speech Lang..

[20]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[21]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Tomohiro Nakatani,et al.  Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Tomohiro Nakatani,et al.  Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberation , 2019, 2019 27th European Signal Processing Conference (EUSIPCO).

[24]  Reinhold Häb-Umbach,et al.  Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Reinhold Häb-Umbach,et al.  Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings , 2017, INTERSPEECH.

[26]  Hiroshi Sawada,et al.  Simultaneous Denoising, Dereverberation, and Source Separation Using a Unified Convolutional Beamformer , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[27]  R. Maas,et al.  A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research , 2016, EURASIP Journal on Advances in Signal Processing.

[28]  Tomohiro Nakatani,et al.  Noisy cGMM: Complex Gaussian Mixture Model with Non-Sparse Noise Model for Joint Source Separation and Denoising , 2018, 2018 26th European Signal Processing Conference (EUSIPCO).

[29]  Hiroshi Sawada,et al.  Underdetermined Convolutive Blind Source Separation via Frequency Bin-Wise Clustering and Permutation Alignment , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Sharon Gannot,et al.  Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Tomohiro Nakatani,et al.  A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation , 2018, IEEE Signal Processing Letters.

[32]  Tomohiro Nakatani,et al.  Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response Shortening , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Yong Xu,et al.  A comprehensive study of speech separation: spectrogram vs waveform separation , 2019, INTERSPEECH.