LaFurca: Iterative Multi-Stage Refined End-to-End Monaural Speech Separation Based on Context-Aware Dual-Path Deep Parallel Inter-Intra Bi-LSTM

Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation, e.g. DPRNN-TasNet \cite{luo2019dual}. In this paper, we propose several improvements of dual-path BiLSTM based network for end-to-end approach to monaural speech separation, which consists of 1) dual-path network with intra-parallel BiLSTM and inter-parallel BiLSTM components, 2) global context aware inter-intra cross-parallel BiLSTM, 3) multiple spiral iterative refinement dual-path BiLSTM, that all these networks take the mixed utterance of two speakers and map it to two separated utterances, where each utterance contains only one speaker's voice. For the objective, we propose to train the network by directly optimizing utterance level scale-invariant signal-to-distortion ratio (SI-SDR) in a permutation invariant training (PIT) style. Our experiments on the public WSJ0-2mix data corpus results in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ, and 94.86\% of ESTOI, which shows our proposed networks can lead to performance improvement on the speaker separation task. We have open-sourced our re-implementation of the DPRNN-TasNet in this https URL, and our LaFurca is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be smoothly reproduced.

[1]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Andries P. Hekstra,et al.  Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Liu Liu,et al.  Is CQT more suitable for monaural speech separation than STFT? an empirical study , 2019, ArXiv.

[6]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Yossi Adi,et al.  Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[8]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Huibin Lin,et al.  Furcax: End-to-end Monaural Speech Separation Based on Deep Gated (De)convolutional Neural Networks with Adversarial Example Training , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Guy J. Brown,et al.  Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2006 .

[13]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[14]  Liu Liu,et al.  FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks , 2019, MMM.

[15]  Neil Zeghidour,et al.  Wavesplit: End-to-End Speech Separation by Speaker Clustering , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  DeLiang Wang,et al.  Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Huibin Lin,et al.  Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation , 2019, INTERSPEECH.

[18]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Jonathan Le Roux,et al.  SDR – Half-baked or Well Done? , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[21]  Huibin Lin,et al.  End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network , 2019, INTERSPEECH.

[22]  Nima Mesgarani,et al.  Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  DeLiang Wang,et al.  Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Tuomas Virtanen,et al.  Speech recognition using factorial hidden Markov models for separation in the feature space , 2006, INTERSPEECH.

[25]  Jesper Jensen,et al.  An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26]  DeLiang Wang,et al.  An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Tong Zhang,et al.  Spatial–Temporal Recurrent Neural Network for Emotion Recognition , 2017, IEEE Transactions on Cybernetics.

[28]  Jonathan Le Roux,et al.  Universal Sound Separation , 2019, 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[29]  DeLiang Wang,et al.  Model-based sequential organization in cochannel speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.