Real-Time Binaural Speech Separation with Preserved Spatial Cues

Deep learning speech separation algorithms have achieved great success in improving the quality and intelligibility of separated speech from mixed audio. Most previous methods focused on generating a single-channel output for each of the target speakers, hence discarding the spatial cues needed for the localization of sound sources in space. However, preserving the spatial information is important in many applications that aim to accurately render the acoustic scene such as in hearing aids and augmented reality (AR). Here, we propose a speech separation algorithm that preserves the interaural cues of separated sound sources and can be implemented with low latency and high fidelity, therefore enabling a real-time modification of the acoustic scene. Based on the time-domain audio separation network (TasNet), a single-channel time-domain speech separation system that can be implemented in real-time, we propose a multi-input-multi-output (MIMO) end-to-end extension of TasNet that takes binaural mixed audio as input and simultaneously separates target speakers in both channels. Experimental results show that the proposed end-to-end MIMO system is able to significantly improve the separation performance and keep the perceived location of the modified sources intact in various acoustic scenes.

[1]  Masoud Geravanchizadeh,et al.  Robust binaural speech separation in adverse conditions based on deep neural network with modified spatial features and training target , 2019, Speech Commun..

[2]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4]  Tom C. T. Yin,et al.  Neural Mechanisms of Encoding Binaural Localization Cues in the Auditory Brainstem , 2002 .

[5]  Yong Xu,et al.  Iterative Deep Neural Networks for Speaker-Independent Binaural Blind Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Marc Moonen,et al.  THEORETICAL ANALYSIS OF BINAURAL CUE PRESERVATION USING MULTI-CHANNEL WIENER FILTERING AND INTERAURAL TRANSFER FUNCTIONS , 2006 .

[7]  Gerald Enzner,et al.  Binaural noise reduction via cue-preserving MMSE filter and adaptive-blocking-based noise PSD estimation , 2017, EURASIP J. Adv. Signal Process..

[8]  Rainer Martin,et al.  GSC-Based Binaural Speaker Separation Preserving Spatial Cues , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Peter Vary,et al.  Dual-Channel Speech Enhancement by Superdirective Beamforming , 2006, EURASIP J. Adv. Signal Process..

[11]  Ehud Weinstein,et al.  Signal enhancement using beamforming and nonstationarity with applications to speech , 2001, IEEE Trans. Signal Process..

[12]  Shih-Chii Liu,et al.  FaSNet: Low-Latency Adaptive Beamforming for Multi-Microphone Audio Processing , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Walter Kellermann,et al.  Analysis of two generic Wiener filtering concepts for binaural speech enhancement in hearing aids , 2010, 2010 18th European Signal Processing Conference.

[14]  DeLiang Wang,et al.  Deep Learning Based Binaural Speech Separation in Reverberant Environments , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Volker Hohmann,et al.  Robustness Analysis of Binaural Hearing Aid Beamformer Algorithms by Means of Objective Perceptual Quality Measures , 2007, 2007 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

[16]  Helmut Haas,et al.  The Influence of a Single Echo on the Audibility of Speech , 1972 .

[17]  Sharon Gannot,et al.  Theoretical Analysis of Binaural Transfer Function MVDR Beamformers with Interference Cue Preservation Constraints , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Yonghong Yan,et al.  A Deep Learning Based Binaural Speech Enhancement Approach with Spatial Cues Preservation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Liu Liu,et al.  FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks , 2019, MMM.

[20]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[21]  Yong Xu,et al.  End-to-End Multi-Channel Speech Separation , 2019, ArXiv.

[22]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[23]  L. McEvoy,et al.  Human auditory cortical mechanisms of sound lateralization: II. Interaural time differences at sound onset , 1993, Hearing Research.

[24]  C. Avendano,et al.  The CIPIC HRTF database , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[25]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[26]  Paris Smaragdis,et al.  End-To-End Source Separation With Adaptive Front-Ends , 2017, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[27]  David V. Anderson,et al.  Perceptually Inspired Noise-Reduction Method for Binaural Hearing Aids , 2012, IEEE Transactions on Audio, Speech, and Language Processing.