论文信息 - Deep Casa for Talker-independent Monaural Speech Separation

Deep Casa for Talker-independent Monaural Speech Separation

Monaural speech separation is the task of separating target speech from interference in single-channel recordings. Although substantial progress has been made recently in deep learning based speech separation, previous studies usually focus on a single type of interference, either background noise or competing speakers. In this study, we address both speech and nonspeech interference, i.e., monaural speaker separation in noise, in a talker-independent fashion. We extend a recently proposed deep CASA system to deal with noisy speaker mixtures. To facilitate speech enhancement, a denoising module is added to deep CASA as a front-end processor. The proposed systems achieve state-of-the-art results on a benchmark noisy two-speaker separation dataset. The denoising module leads to substantial performance gain across various noise types, and even better generalization in noise-free conditions.

[1] DeLiang Wang,et al. A New Framework for CNN-Based Speech Enhancement in the Time Domain , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2] DeLiang Wang,et al. Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Jonathan Le Roux,et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] DeLiang Wang,et al. Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises. , 2016, The Journal of the Acoustical Society of America.

[7] DeLiang Wang,et al. A Casa Approach to Deep Learning Based Speaker-Independent Co-Channel Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Jesper Jensen,et al. Joint separation and denoising of noisy multi-talker speech using recurrent neural networks and permutation invariant training , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[10] E. M. Grimm,et al. Toward a Recommendation for a European Standard of Peak and LKFS Loudness Levels , 2010 .

[11] Jinwon Lee,et al. A Fully Convolutional Neural Network for Speech Enhancement , 2016, INTERSPEECH.

[12] Jun Du,et al. Speech separation of a target speaker based on deep neural networks , 2014, 2014 12th International Conference on Signal Processing (ICSP).

[13] DeLiang Wang,et al. Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] DeLiang Wang,et al. Towards Scaling Up Classification-Based Speech Separation , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Paris Smaragdis,et al. Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Nima Mesgarani,et al. Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[18] Jesper Jensen,et al. An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19] E. C. Cmm,et al. on the Recognition of Speech, with , 2008 .

[20] Jon Barker,et al. An analysis of environment, microphone and data simulation mismatches in robust speech recognition , 2017, Comput. Speech Lang..

[21] DeLiang Wang,et al. Complex Ratio Masking for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22] Jonathan Le Roux,et al. WHAM!: Extending Speech Separation to Noisy Environments , 2019, INTERSPEECH.

[23] Nima Mesgarani,et al. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Nima Mesgarani,et al. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[26] DeLiang Wang,et al. A Deep Ensemble Learning Method for Monaural Speech Separation , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.