Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR

Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our system generalizes well to a larger number of speakers than it ever saw during training, as shown in experiments with the WSJ0-4mix database.

[1]  Jonathan Le Roux,et al.  End-to-End Multi-Speaker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Dong Yu,et al.  Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training , 2017, Speech Commun..

[3]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[4]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[5]  Jonathan Le Roux,et al.  A Purely End-to-End System for Multi-speaker Speech Recognition , 2018, ACL.

[6]  Shinji Watanabe,et al.  End-to-end Monaural Multi-speaker ASR System without Pretraining , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yong Xu,et al.  A comprehensive study of speech separation: spectrogram vs waveform separation , 2019, INTERSPEECH.

[8]  Hermann Ney,et al.  Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech , 2019, INTERSPEECH.

[9]  Tomohiro Nakatani,et al.  Listening to Each Speaker One by One with Recurrent Selective Hearing Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Nima Mesgarani,et al.  Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Xiaofei Wang,et al.  Serialized Output Training for End-to-End Overlapped Speech Recognition , 2020, INTERSPEECH.

[12]  Tomohiro Nakatani,et al.  End-to-End Training of Time Domain Audio Separation and Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Takuya Yoshioka,et al.  Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Shinji Watanabe,et al.  Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Dong Yu,et al.  Recognizing Multi-talker Speech with Permutation Invariant Training , 2017, INTERSPEECH.

[18]  Rémi Gribonval,et al.  BSS_EVAL Toolbox User Guide -- Revision 2.0 , 2005 .

[19]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[20]  Reinhold Haeb-Umbach,et al.  Demystifying TasNet: A Dissecting Approach , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Nima Mesgarani,et al.  TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[24]  John R. Hershey,et al.  Hybrid CTC/Attention Architecture for End-to-End Speech Recognition , 2017, IEEE Journal of Selected Topics in Signal Processing.

[25]  Wei Chu,et al.  Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[26]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[27]  Dong Yu,et al.  Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Naoya Takahashi,et al.  Recursive speech separation for unknown number of speakers , 2019, INTERSPEECH.

[29]  Daniel P. W. Ellis,et al.  MIR_EVAL: A Transparent Implementation of Common MIR Metrics , 2014, ISMIR.

[30]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[31]  Reinhold Häb-Umbach,et al.  Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yossi Adi,et al.  Voice Separation with an Unknown Number of Multiple Speakers , 2020, ICML.

[33]  Shinji Watanabe,et al.  Improving End-to-End Single-Channel Multi-Talker Speech Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.