Separator-Transducer-Segmenter: Streaming Recognition and Segmentation of Multi-party Speech

Streaming recognition and segmentation of multi-party conversations with overlapping speech is crucial for the next gener-ation of voice assistant applications. In this work we address its challenges discovered in the previous work on multi-turn recurrent neural network transducer (MT-RNN-T) with a novel approach, separator-transducer-segmenter (STS), that enables tighter integration of speech separation, recognition and segmentation in a single model. First, we propose a new segmentation modeling strategy through start-of-turn and end-of-turn tokens that improves segmentation without recognition accuracy degradation. Second, we further improve both speech recognition and segmentation accuracy through an emission regularization method, FastEmit, and multi-task training with speech activity information as an additional training signal. Third, we experiment with end-of-turn emission latency penalty to improve end-point detection for each speaker turn. Finally, we estab-lish a novel framework for segmentation analysis of multi-party conversations through emission latency metrics. With our best model, we report 4.6% abs. turn counting accuracy improvement and 17% rel. word error rate (WER) improvement on LibriCSS dataset compared to the previously published work.

[1]  Jinyu Li,et al.  Streaming Multi-Talker ASR with Token-Level Serialized Output Training , 2022, INTERSPEECH.

[2]  Liang Lu,et al.  Endpoint Detection for Streaming End-to-End Multi-Talker ASR , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Xianrui Zheng,et al.  Multi-Turn RNN-T for Streaming Recognition of Multi-Party Speech , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Quan Wang,et al.  Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Yashesh Gaur,et al.  Continuous Streaming Multi-Talker ASR with Dual-Path Transducers , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Kyu J. Han,et al.  A Review of Speaker Diarization: Recent Advances with Deep Learning , 2021, Comput. Speech Lang..

[7]  Xiong Xiao,et al.  A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8]  Naoyuki Kanda,et al.  End-to-End Speaker-Attributed ASR with Transformer , 2021, Interspeech.

[9]  Naoyuki Kanda,et al.  Streaming Multi-talker Speech Recognition with Joint Speaker Identification , 2021, Interspeech.

[10]  Naoyuki Kanda,et al.  Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone , 2021, Interspeech.

[11]  Jinyu Li,et al.  Streaming End-to-End Multi-Talker Speech Recognition , 2020, IEEE Signal Processing Letters.

[12]  Yulan Liu,et al.  Streaming Multi-Speaker ASR with RNN-T , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tara N. Sainath,et al.  FastEmit: Low-Latency Streaming ASR with Sequence-Level Emission Regularization , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Xiaofei Wang,et al.  Investigation of End-to-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[15]  Naoyuki Kanda,et al.  Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers , 2020, INTERSPEECH.

[16]  Tomohiro Nakatani,et al.  Multi-talker ASR for an unknown number of sources: Joint training of source counting, separation and ASR , 2020, INTERSPEECH.

[17]  Han Lu,et al.  End-To-End Multi-Talker Overlapping Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tara N. Sainath,et al.  Towards Fast and Accurate Streaming End-To-End ASR , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Xiaofei Wang,et al.  Serialized Output Training for End-to-End Overlapped Speech Recognition , 2020, INTERSPEECH.

[20]  Jinyu Li,et al.  Continuous Speech Separation: Dataset and Analysis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Thilo von Neumann,et al.  End-to-End Training of Time Domain Audio Separation and Recognition , 2019, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Quoc V. Le,et al.  Specaugment on Large Scale Datasets , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Hermann Ney,et al.  Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech , 2019, INTERSPEECH.

[24]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[25]  Shinji Watanabe,et al.  End-to-end Monaural Multi-speaker ASR System without Pretraining , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jonathan Le Roux,et al.  End-to-End Multi-Speaker Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[28]  Jonathan Le Roux,et al.  A Purely End-to-End System for Multi-speaker Speech Recognition , 2018, ACL.

[29]  Jon Barker,et al.  The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines , 2018, INTERSPEECH.

[30]  Dong Yu,et al.  Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training , 2017, Speech Commun..

[31]  Dong Yu,et al.  Recognizing Multi-talker Speech with Permutation Invariant Training , 2017, INTERSPEECH.

[32]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Thomas Hain,et al.  The Sheffield Wargame Corpus - Day Two and Day Three , 2016, INTERSPEECH.

[34]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[35]  Jonathan Le Roux,et al.  Single-Channel Multi-Speaker Separation Using Deep Clustering , 2016, INTERSPEECH.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Thomas Hain,et al.  The sheffield wargames corpus , 2013, INTERSPEECH.

[38]  Alex Graves,et al.  Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[39]  Lukás Burget,et al.  The AMI System for the Transcription of Speech in Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.