Streaming Multi-Talker ASR with Token-Level Serialized Output Training

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

[1]  Jonathan Le Roux,et al.  Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Liang Lu,et al.  Endpoint Detection for Streaming End-to-End Multi-Talker ASR , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Xianrui Zheng,et al.  Multi-Turn RNN-T for Streaming Recognition of Multi-Party Speech , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Yashesh Gaur,et al.  Continuous Streaming Multi-Talker ASR with Dual-Path Transducers , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Xiong Xiao,et al.  A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[6]  Naoyuki Kanda,et al.  Investigation of Practical Aspects of Single Channel Speech Separation for ASR , 2021, Interspeech.

[7]  Naoyuki Kanda,et al.  End-to-End Speaker-Attributed ASR with Transformer , 2021, Interspeech.

[8]  Naoyuki Kanda,et al.  Streaming Multi-talker Speech Recognition with Joint Speaker Identification , 2021, Interspeech.

[9]  Naoyuki Kanda,et al.  Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone , 2021, Interspeech.

[10]  Jinyu Li,et al.  Streaming End-to-End Multi-Talker Speech Recognition , 2020, IEEE Signal Processing Letters.

[11]  Yulan Liu,et al.  Streaming Multi-Speaker ASR with RNN-T , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Naoyuki Kanda,et al.  Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[13]  Yu Wu,et al.  Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Jinyu Li,et al.  Continuous Speech Separation with Conformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Naoyuki Kanda,et al.  Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers , 2020, INTERSPEECH.

[16]  Han Lu,et al.  End-To-End Multi-Talker Overlapping Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[18]  Xiaofei Wang,et al.  Serialized Output Training for End-to-End Overlapped Speech Recognition , 2020, INTERSPEECH.

[19]  Qian Zhang,et al.  Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Jinyu Li,et al.  Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Quoc V. Le,et al.  Specaugment on Large Scale Datasets , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Jinyu Li,et al.  Semantic Mask for Transformer based End-to-End Speech Recognition , 2019, INTERSPEECH.

[23]  Takuya Yoshioka,et al.  Advances in Online Audio-Visual Meeting Transcription , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[24]  Jonathan Le Roux,et al.  MIMO-Speech: End-to-End Multi-Channel Multi-Speaker Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25]  Hagen Soltau,et al.  Joint Speech Recognition and Speaker Diarization via Sequence Transduction , 2019, INTERSPEECH.

[26]  Naoyuki Kanda,et al.  Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR , 2019, INTERSPEECH.

[27]  Shinji Watanabe,et al.  End-to-end Monaural Multi-speaker ASR System without Pretraining , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Xiong Xiao,et al.  Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks , 2018, INTERSPEECH.

[29]  Jonathan Le Roux,et al.  A Purely End-to-End System for Multi-speaker Speech Recognition , 2018, ACL.

[30]  Tatsuya Kawahara,et al.  An End-to-End Approach to Joint Social Signal Detection and Automatic Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[32]  John R. Hershey,et al.  Language independent end-to-end architecture for joint language identification and speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[33]  Morgan Sonderegger,et al.  Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[34]  Dong Yu,et al.  Recognizing Multi-talker Speech with Permutation Invariant Training , 2017, INTERSPEECH.

[35]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[37]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[38]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[40]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[41]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[42]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[43]  Jonathan G. Fiscus,et al.  Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech , 2006, LREC.

[44]  Elizabeth Shriberg,et al.  Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition , 2006, INTERSPEECH.

[45]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[46]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.