Which Ones Are Speaking? Speaker-Inferred Model for Multi-Talker Speech Separation

Recent deep learning methods have gained noteworthy success in the multi-talker mixed speech separation task, which is also famous known as the Cocktail Party Problem. However, most existing models are well-designed towards some predefined conditions, which make them unable to handle the complex auditory scene automatically, such as a variable and unknown number of speakers in the mixture. In this paper, we propose a speaker-inferred model, based on the flexible and efficient Seq2Seq generation model, to accurately infer the possible speakers and the speech channel of each. Our model is totally end-to-end with several different modules to emphasize and better utilize the information from speakers. Without a priori knowledge about the number of speakers or any additional curriculum training strategy or man-made rules, our method gets comparable performance with those strong baselines.

[1]  Li-Rong Dai,et al.  Source-Aware Context Network for Single-Channel Multi-Speaker Speech Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Kevin Wilson,et al.  Looking to listen at the cocktail party , 2018, ACM Trans. Graph..

[3]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[4]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Bo Xu,et al.  Modeling Attention and Memory for Auditory Selection in a Cocktail Party Environment , 2018, AAAI.

[6]  Yong Zhang,et al.  A Hierarchical Attention Seq2seq Model with CopyNet for Text Summarization , 2018, 2018 International Conference on Robots & Intelligent System (ICRIS).

[7]  Zhuo Chen,et al.  Deep clustering: Discriminative embeddings for segmentation and separation , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Nima Mesgarani,et al.  Deep attractor network for single-microphone speaker separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[10]  Nima Mesgarani,et al.  Real-time Single-channel Dereverberation and Separation with Time-domain Audio Separation Network , 2018, INTERSPEECH.

[11]  Guangcan Liu,et al.  Listen, Think and Listen Again: Capturing Top-down Auditory Attention for Speaker-independent Speech Separation , 2018, IJCAI.

[12]  Haizhou Li,et al.  Single Channel Speech Separation with Constrained Utterance Level Permutation Invariant Training Using Grid LSTM , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Peng Gao,et al.  CBLDNN-Based Speaker-Independent Speech Separation Via Generative Adversarial Training , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  E. C. Cmm,et al.  on the Recognition of Speech, with , 2008 .

[15]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Nima Mesgarani,et al.  Speaker-Independent Speech Separation With Deep Attractor Network , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Jesper Jensen,et al.  Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement Systems , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[21]  Wei Wu,et al.  SGM: Sequence Generation Model for Multi-label Classification , 2018, COLING.

[22]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[24]  Reinhold Häb-Umbach,et al.  Deep Attractor Networks for Speaker Re-Identification and Blind Source Separation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[26]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.