论文信息 - Which Ones Are Speaking? Speaker-Inferred Model for Multi-Talker Speech Separation

Which Ones Are Speaking? Speaker-Inferred Model for Multi-Talker Speech Separation

Recent deep learning methods have gained noteworthy success in the multi-talker mixed speech separation task, which is also famous known as the Cocktail Party Problem. However, most existing models are well-designed towards some predefined conditions, which make them unable to handle the complex auditory scene automatically, such as a variable and unknown number of speakers in the mixture. In this paper, we propose a speaker-inferred model, based on the flexible and efficient Seq2Seq generation model, to accurately infer the possible speakers and the speech channel of each. Our model is totally end-to-end with several different modules to emphasize and better utilize the information from speakers. Without a priori knowledge about the number of speakers or any additional curriculum training strategy or man-made rules, our method gets comparable performance with those strong baselines.

[1] Li-Rong Dai,et al. Source-Aware Context Network for Single-Channel Multi-Speaker Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Kevin Wilson,et al. Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[3] Jonathan Le Roux,et al. Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[4] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Bo Xu,et al. Modeling Attention and Memory for Auditory Selection in a Cocktail Party Environment , 2018, AAAI.

[6] Yong Zhang,et al. A Hierarchical Attention Seq2seq Model with CopyNet for Text Summarization , 2018, 2018 International Conference on Robots & Intelligent System (ICRIS).

[7] Zhuo Chen,et al. Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Nima Mesgarani,et al. Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Geoffrey Zweig,et al. Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[10] Nima Mesgarani,et al. Real-time Single-channel Dereverberation and Separation with Time-domain Audio Separation Network , 2018, INTERSPEECH.

[11] Guangcan Liu,et al. Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation , 2018, IJCAI.

[12] Haizhou Li,et al. Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Peng Gao,et al. CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] E. C. Cmm,et al. on the Recognition of Speech, with , 2008 .

[15] Nima Mesgarani,et al. TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Nima Mesgarani,et al. Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] Jesper Jensen,et al. Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18] Mikkel N. Schmidt,et al. Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[19] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[20] Guy J. Brown,et al. Computational auditory scene analysis , 1994, Comput. Speech Lang..

[21] Wei Wu,et al. SGM: Sequence Generation Model for Multi-label Classification , 2018, COLING.

[22] Rémi Gribonval,et al. Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23] Alexander M. Rush,et al. Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[24] Reinhold Häb-Umbach,et al. Deep Attractor Networks for Speaker Re-Identification and Blind Source Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[26] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28] Dong Yu,et al. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.