论文信息 - Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation

We address talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we decompose the multi-speaker separation task into the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers with a permutation-invariantly trained neural network. In the second stage, the frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The proposed deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both objectives. Experimental results on the benchmark WSJ0-2mix database show that the new approach achieves the state-of-the-art results with a modest model size.

DeLiang Wang | Yuzhou Liu | Deliang Wang | Yuzhou Liu

[1] DeLiang Wang,et al. A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2] Zhong-Qiu Wang,et al. Alternative Objective Functions for Deep Clustering , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Vladlen Koltun,et al. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[4] D. Wang,et al. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications , 2008, IEEE Trans. Neural Networks.

[5] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6] Jonathan Le Roux,et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Zoubin Ghahramani,et al. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[8] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[9] DeLiang Wang,et al. A Casa Approach to Deep Learning Based Speaker-Independent Co-Channel Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11] Gregory D. Hager,et al. Temporal Convolutional Networks: A Unified Approach to Action Segmentation , 2016, ECCV Workshops.

[12] DeLiang Wang,et al. An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker. , 2017, The Journal of the Acoustical Society of America.

[13] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[14] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[15] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[17] Ian McLoughlin,et al. Listening and Grouping: An Online Autoregressive Approach for Monaural Speech Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[19] Erhardt Barth,et al. Recurrent Dropout without Memory Loss , 2016, COLING.

[20] Tillman Weyde,et al. Singing Voice Separation with Deep U-Net Convolutional Networks , 2017, ISMIR.

[21] Liu Liu,et al. FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks , 2019, MMM.

[22] Wei Jiang,et al. A New Bayesian Method Incorporating With Local Correlation for IBM Estimation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[23] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24] DeLiang Wang,et al. Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Naoya Takahashi,et al. Multi-Scale multi-band densenets for audio source separation , 2017, 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[26] Haizhou Li,et al. Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Nima Mesgarani,et al. TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation. , 2018 .

[28] DeLiang Wang,et al. TCNN: Temporal Convolutional Neural Network for Real-time Speech Enhancement in the Time Domain , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29] DeLiang Wang,et al. A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[30] Jesper Jensen,et al. An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[32] Paris Smaragdis,et al. Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] DeLiang Wang,et al. An Unsupervised Approach to Cochannel Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[34] Peng Gao,et al. CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Zhong-Qiu Wang,et al. End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction , 2018, INTERSPEECH.

[36] Nima Mesgarani,et al. Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[37] D S Brungart,et al. Informational and energetic masking effects in the perception of two simultaneous talkers. , 2001, The Journal of the Acoustical Society of America.

[38] DeLiang Wang,et al. A Tandem Algorithm for Pitch Estimation and Voiced Speech Segregation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[39] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[41] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[42] Nitish Srivastava,et al. Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[43] Jun Du,et al. Speech separation of a target speaker based on deep neural networks , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[44] Kilian Q. Weinberger,et al. Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).