Blind Speech Separation and Dereverberation using Neural Beamforming

In this paper, we present the Blind Speech Separation and Dereverberation (BSSD) network, which performs simultaneous speaker separation, dereverberation and speaker identification in a single neural network. Speaker separation is guided by a set of predefined spatial cues. Dereverberation is performed by using neural beamforming, and speaker identification is aided by embedding vectors and triplet mining. We introduce a frequencydomain model which uses complex-valued neural networks, and a time-domain variant which performs beamforming in latent space. Further, we propose a block-online mode to process longer audio recordings, as they occur in meeting scenarios. We evaluate our system in terms of Scale Independent Signal to Distortion Ratio (SI-SDR), Word Error Rate (WER) and Equal Error Rate (EER).

[1]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Robert F. H. Fischer,et al.  Precoding and Signal Shaping for Digital Transmission , 2002 .

[3]  Reinhold Häb-Umbach,et al.  Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Xiao Liu,et al.  Deep Speaker: an End-to-End Neural Speaker Embedding System , 2017, ArXiv.

[5]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  DeLiang Wang,et al.  Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Franz Pernkopf,et al.  Eigenvector-Based Speech Mask Estimation for Multi-Channel Speech Enhancement , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Franz Pernkopf,et al.  Deep Complex-valued Neural Beamformers , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Ivan Dokmanic,et al.  Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Tomohiro Nakatani,et al.  All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Tomohiro Nakatani,et al.  Neural Network-Based Spectrum Estimation for Online WPE Dereverberation , 2017, INTERSPEECH.

[15]  Martti Vainio,et al.  Proceedings of the Annual Conference of the International Speech Communication Association , 2016, Interspeech 2016.

[16]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jon Barker,et al.  The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  P. Bouboulis,et al.  Wirtinger's Calculus in general Hilbert Spaces , 2010, ArXiv.

[19]  Tomohiro Nakatani,et al.  SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures , 2019, IEEE Journal of Selected Topics in Signal Processing.

[20]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[21]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  E. Habets,et al.  Generating sensor signals in isotropic noise fields. , 2007, The Journal of the Acoustical Society of America.

[23]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tomohiro Nakatani,et al.  DNN-supported Mask-based Convolutional Beamforming for Simultaneous Denoising, Dereverberation, and Source Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Chong Wang,et al.  How to Train Triplet Networks with 100K Identities? , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[27]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[29]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30]  Derrick G. Kourie,et al.  Numerical verification of bidirectional reflectance distribution functions for physical plausibility , 2013, SAICSIT '13.

[31]  Hakan Erdogan,et al.  Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[33]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Franz Pernkopf,et al.  Eigenvector-Based Speech Mask Estimation Using Logistic Regression , 2017, INTERSPEECH.

[35]  Tomohiro Nakatani,et al.  Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures , 2017, INTERSPEECH.

[36]  Bhiksha Raj,et al.  W-Net BF: DNN-based Beamformer Using Joint Training Approach , 2019, ArXiv.

[37]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[38]  Tomohiro Nakatani,et al.  A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation , 2018, IEEE Signal Processing Letters.

[39]  Tomohiro Nakatani,et al.  Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , 2012, IEEE Signal Process. Mag..

[40]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[41]  Terrence J. Sejnowski,et al.  The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[42]  Tomohiro Nakatani,et al.  Beam-TasNet: Time-domain Audio Separation Network Meets Frequency-domain Beamformer , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Visar Berisha,et al.  Triplet Network with Attention for Speaker Diarization , 2018, INTERSPEECH.

[44]  Xiong Xiao,et al.  Low-latency Speaker-independent Continuous Speech Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Arun Ross,et al.  Microphone Arrays , 2009, Encyclopedia of Biometrics.

[46]  Sandeep Subramanian,et al.  Deep Complex Networks , 2017, ICLR.

[47]  Zhong-Qiu Wang,et al.  Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation , 2018, INTERSPEECH.

[48]  Jonathan Le Roux,et al.  MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[49]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[50]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[51]  Jingdong Chen,et al.  Acoustic MIMO Signal Processing , 2006 .

[52]  Shih-Fu Chang,et al.  Learning Spread-Out Local Feature Descriptors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  John H. L. Hansen,et al.  Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[54]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[55]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[56]  Xiong Xiao,et al.  Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks , 2018, INTERSPEECH.

[57]  Alexander J. Smola,et al.  Sampling Matters in Deep Embedding Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[58]  Kazuyuki Murase,et al.  Wirtinger Calculus Based Gradient Descent and Levenberg-Marquardt Learning Algorithms in Complex-Valued Neural Networks , 2011, ICONIP.

[59]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).