Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement

This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation. Our neural networks for separation use an advanced convolutional architecture trained with a novel stabilized signal-to-noise ratio loss function. For beamforming, we explore multiple ways of computing time-varying covariance matrices, including factorizing the spatial covariance into a time-varying amplitude component and a time-invariant spatial component, as well as using block-based techniques. In addition, we introduce a multi-frame beamforming method which improves the results significantly by adding contextual frames to the beamforming formulations. We extensively evaluate and analyze the effects of window size, block size, and multi-frame context size for these methods. Our best method utilizes a sequence of three neural separation and multi-frame time-invariant spatial beamforming stages, and demonstrates an average improvement of 2.75 dB in scale-invariant signal-to-noise ratio and 14.2% absolute reduction in a comparative speech recognition metric across four challenging reverberant speech enhancement and separation tasks. We also use our three-speaker separation model to separate real recordings in the LibriCSS evaluation set into non-overlapping tracks, and achieve a better word error rate as compared to a baseline mask based beamformer.

[1]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[2]  Takuya Yoshioka,et al.  Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Xiong Xiao,et al.  Multi-Channel Overlapped Speech Recognition with Location Guided Speech Extraction Network , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[4]  Reinhold Häb-Umbach,et al.  Neural network based spectral mask estimation for acoustic beamforming , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tomohiro Nakatani,et al.  Mask-based MVDR Beamformer for Noisy Multisource Environments: Introduction of Time-varying Spatial Covariance Model , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jonathan Le Roux,et al.  Universal Sound Separation , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[8]  Zhuo Chen,et al.  Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Armand Joulin,et al.  Libri-Light: A Benchmark for ASR with Limited or No Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Naoyuki Kanda,et al.  Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[12]  DeLiang Wang,et al.  Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  DeLiang Wang,et al.  Combining Spectral and Spatial Features for Deep Learning Based Blind Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Marc Moonen,et al.  On the Modeling of Rectangular Geometries in Room Acoustic Simulations , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Jacob Benesty,et al.  On Optimal Frequency-Domain Multichannel Linear Filtering for Noise Reduction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Rohit Prabhavalkar,et al.  On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition , 2019, INTERSPEECH.

[18]  Ladislav Mošner,et al.  Building and Evaluation of a Real Room Impulse Response Dataset , 2018, IEEE Journal of Selected Topics in Signal Processing.

[19]  Hakan Erdogan,et al.  Multi-Microphone Neural Speech Separation for Far-Field Multi-Talker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Alastair H. Moore,et al.  The ACE challenge — Corpus description and performance evaluation , 2015, 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[21]  Zhong-Qiu Wang,et al.  Multi-Channel Deep Clustering: Discriminative Spectral and Spatial Embeddings for Speaker-Independent Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  John R. Hershey,et al.  VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking , 2018, INTERSPEECH.

[23]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tomohiro Nakatani,et al.  Single Channel Target Speaker Extraction and Recognition with Speaker Beam , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  DeLiang Wang,et al.  A speech enhancement algorithm by iterating single- and multi-microphone processing and its application to robust ASR , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jon Barker,et al.  The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes , 2017, Comput. Speech Lang..

[27]  Scott Wisdom,et al.  Differentiable Consistency Constraints for Improved Deep Speech Enhancement , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Jonathan Le Roux,et al.  Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks , 2016, INTERSPEECH.

[29]  Timothy B. Terriberry,et al.  High-Quality, Low-Delay Music Coding in the Opus Codec , 2016, ArXiv.

[30]  Aren Jansen,et al.  Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[32]  Tomohiro Nakatani,et al.  A Unified Convolutional Beamformer for Simultaneous Denoising and Dereverberation , 2018, IEEE Signal Processing Letters.

[33]  Sharon Gannot,et al.  Sensitivity analysis of MVDR and MPDR beamformers , 2010, 2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel.

[34]  Xiong Xiao,et al.  Low-latency Speaker-independent Continuous Speech Separation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35]  Reinhold Häb-Umbach,et al.  BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[36]  Tatsuya Kawahara,et al.  Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37]  Takuya Yoshioka,et al.  Exploring Practical Aspects of Neural Mask-Based Beamforming for Far-Field Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  DeLiang Wang,et al.  Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.