Online Self-Attentive Gated RNNs for Real-Time Speaker Separation

Deep neural networks have recently shown great success in the task of blind source separation, both under monaural and binaural settings. Although these methods were shown to produce high-quality separations, they were mainly applied under offline settings, in which the model has access to the full input signal while separating the signal. In this study, we convert a non-causal state-of-the-art separation model into a causal and real-time model and evaluate its performance under both online and offline settings. We compare the performance of the proposed model to several baseline methods under anechoic, noisy, and noisy-reverberant recording conditions while exploring both monaural and binaural inputs and outputs. Our findings shed light on the relative difference between causal and non-causal models when performing separation. Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB for binaural inputs while reaching a real-time factor of 0.65. Samples can be found under the following link: https://kwanum.github.io/ sagrnnc-stream-results/.

[1]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Simon Dixon,et al.  Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation , 2018, ISMIR.

[3]  Masahito Togami,et al.  Online speech source separation based on maximum likelihood of local Gaussian modeling , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[6]  Tomohiro Nakatani,et al.  All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  C. Avendano,et al.  The CIPIC HRTF database , 2001, Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575).

[8]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Yossi Adi,et al.  Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[10]  DeLiang Wang,et al.  Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Emmanuel Vincent,et al.  A General Framework for Online Audio Source Separation , 2012, LVA/ICA.

[12]  Neil Zeghidour,et al.  Wavesplit: End-to-End Speech Separation by Speaker Clustering , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Zhong-Qiu Wang,et al.  Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Ivan Dokmanic,et al.  Pyroomacoustics: A Python Package for Audio Room Simulation and Array Processing Algorithms , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Jonathan Le Roux,et al.  WHAMR!: Noisy and Reverberant Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Nima Mesgarani,et al.  Real-Time Binaural Speech Separation with Preserved Spatial Cues , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Lior Wolf,et al.  Single Channel Voice Separation for Unknown Number of Speakers Under Reverberant and Noisy Settings , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Hiroshi Sawada,et al.  REAL-TIME BLIND SOURCE SEPARATION FOR MOVING SPEAKERS USING BLOCKWISE ICA AND RESIDUAL CROSSTALK SUBTRACTION , 2003 .

[21]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ian McLoughlin,et al.  An Online Speaker-aware Speech Separation Approach Based on Time-domain Representation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Stephen Merity,et al.  Single Headed Attention RNN: Stop Thinking With Your Head , 2019, ArXiv.

[24]  Jonathan Le Roux,et al.  WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[25]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Liu Liu,et al.  FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks , 2019, MMM.

[27]  Mirco Ravanelli,et al.  Attention Is All You Need In Speech Separation , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Zhuo Chen,et al.  An End-to-end Architecture of Online Multi-channel Speech Separation , 2020, INTERSPEECH.

[29]  Anurag Kumar,et al.  SAGRNN: Self-Attentive Gated RNN For Binaural Speaker Separation With Interaural Cue Preservation , 2021, IEEE Signal Processing Letters.

[30]  Paris Smaragdis,et al.  End-To-End Source Separation With Adaptive Front-Ends , 2017, 2018 52nd Asilomar Conference on Signals, Systems, and Computers.

[31]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  J. H. Rindel,et al.  The Use of Computer Modeling in Room Acoustics , 2000 .