论文信息 - Multi-Decoder Dprnn: Source Separation for Variable Number of Speakers

Multi-Decoder Dprnn: Source Separation for Variable Number of Speakers

We propose an end-to-end trainable approach to single-channel speech separation with unknown number of speakers. Our approach extends the MulCat source separation backbone with additional output heads: a count-head to infer the number of speakers, and decoder-heads for reconstructing the original signals. Beyond the model, we also propose a metric on how to evaluate source separation with variable number of speakers. Specifically, we clear up the issue on how to evaluate the quality when the ground-truth has more or less speakers than the ones predicted by the model. We evaluate our approach on the WSJ0-mix datasets, with mixtures up to five speakers. We demonstrate that our approach outperforms state-of-the-art in counting the number of speakers and remains competitive in quality of reconstructed signals.

Mark Hasegawa-Johnson | Junzhe Zhu | Raymond A. Yeh | M. Hasegawa-Johnson | Junzhe Zhu

[1] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2] Paris Smaragdis,et al. Singing-voice separation from monaural recordings using robust principal component analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Paris Smaragdis,et al. Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Daniel P. W. Ellis,et al. Beta Process Sparse Nonnegative Matrix Factorization for Music , 2013, ISMIR.

[5] Naoya Takahashi,et al. Recursive speech separation for unknown number of speakers , 2019, INTERSPEECH.

[6] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7] Yiming Xiao,et al. Improved Source Counting and Separation for Monaural Mixture , 2020, 2004.00175.

[8] Takuya Yoshioka,et al. Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Yossi Adi,et al. Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[10] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Nima Mesgarani,et al. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13] José J. López,et al. Automatic speech recognition in cocktail-party situations: a specific training for separated speech. , 2012, The Journal of the Acoustical Society of America.

[14] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.