论文信息 - Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Streaming recognition and segmentation of multi-party conversations with overlapping speech is crucial for the next gener-ation of voice assistant applications. In this work we address its challenges discovered in the previous work on multi-turn recurrent neural network transducer (MT-RNN-T) with a novel approach, separator-transducer-segmenter (STS), that enables tighter integration of speech separation, recognition and segmentation in a single model. First, we propose a new segmentation modeling strategy through start-of-turn and end-of-turn tokens that improves segmentation without recognition accuracy degradation. Second, we further improve both speech recognition and segmentation accuracy through an emission regularization method, FastEmit, and multi-task training with speech activity information as an additional training signal. Third, we experiment with end-of-turn emission latency penalty to improve end-point detection for each speaker turn. Finally, we estab-lish a novel framework for segmentation analysis of multi-party conversations through emission latency metrics. With our best model, we report 4.6% abs. turn counting accuracy improvement and 17% rel. word error rate (WER) improvement on LibriCSS dataset compared to the previously published work.

Christian Osendorfer | Ilya Sklyar | A. Piunova

[1] Jinyu Li,et al. Streaming Multi-Talker ASR with Token-Level Serialized Output Training , 2022, INTERSPEECH.

[2] Liang Lu,et al. Endpoint Detection for Streaming End-to-End Multi-Talker ASR , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Xianrui Zheng,et al. Multi-Turn RNN-T for Streaming Recognition of Multi-Party Speech , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Quan Wang,et al. Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5] Yashesh Gaur,et al. Continuous Streaming Multi-Talker ASR with Dual-Path Transducers , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Kyu J. Han,et al. A Review of Speaker Diarization: Recent Advances with Deep Learning , 2021, Comput. Speech Lang..

[7] Xiong Xiao,et al. A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8] Naoyuki Kanda,et al. End-to-End Speaker-Attributed ASR with Transformer , 2021, Interspeech.

[9] Naoyuki Kanda,et al. Streaming Multi-talker Speech Recognition with Joint Speaker Identification , 2021, Interspeech.

[10] Naoyuki Kanda,et al. Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone , 2021, Interspeech.

[11] Jinyu Li,et al. Streaming End-to-End Multi-Talker Speech Recognition , 2020, IEEE Signal Processing Letters.

[12] Yulan Liu,et al. Streaming Multi-Speaker ASR with RNN-T , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13] Tara N. Sainath,et al. FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Xiaofei Wang,et al. Investigation of End-to-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[15] Naoyuki Kanda,et al. Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers , 2020, INTERSPEECH.

[16] Tomohiro Nakatani,et al. Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR , 2020, INTERSPEECH.

[17] Han Lu,et al. End-To-End Multi-Talker Overlapping Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Tara N. Sainath,et al. Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Xiaofei Wang,et al. Serialized Output Training for End-to-End Overlapped Speech Recognition , 2020, INTERSPEECH.

[20] Jinyu Li,et al. Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Thilo von Neumann,et al. End-to-End Training of Time Domain Audio Separation and Recognition , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22] Quoc V. Le,et al. Specaugment on Large Scale Datasets , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Hermann Ney,et al. Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech , 2019, INTERSPEECH.

[24] Quoc V. Le,et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[25] Shinji Watanabe,et al. End-to-end Monaural Multi-speaker ASR System without Pretraining , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Jonathan Le Roux,et al. End-to-End Multi-Speaker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[28] Jonathan Le Roux,et al. A Purely End-to-End System for Multi-speaker Speech Recognition , 2018, ACL.

[29] Jon Barker,et al. The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[30] Dong Yu,et al. Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training , 2017, Speech Commun..

[31] Dong Yu,et al. Recognizing Multi-talker Speech with Permutation Invariant Training , 2017, INTERSPEECH.

[32] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33] Thomas Hain,et al. The Sheffield Wargame Corpus - Day Two and Day Three , 2016, INTERSPEECH.

[34] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[35] Jonathan Le Roux,et al. Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[36] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37] Thomas Hain,et al. The sheffield wargames corpus , 2013, INTERSPEECH.

[38] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[39] Lukás Burget,et al. The AMI System for the Transcription of Speech in Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.