论文信息 - End-to-end speaker segmentation for overlap-aware resegmentation

End-to-end speaker segmentation for overlap-aware resegmentation

Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped speech detection), we propose to train an end-to-end segmentation model that does it directly. Inspired by the original end-to-end neural speaker diarization approach (EEND), the task is modeled as a multi-label classification problem using permutation-invariant training. The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms). Experiments on multiple speaker diarization datasets conclude that our model can be used with great success on both voice activity detection and overlapped speech detection. Our proposed model can also be used as a post-processing step, to detect and correctly assign overlapped speech regions. Relative diarization error rate improvement over the best considered baseline (VBx) reaches 17% on AMI, 13% on DIHARD 3, and 13% on VoxConverse.

Antoine Laurent | Herv'e Bredin | H. Bredin | Antoine Laurent

[1] Naoyuki Kanda,et al. End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[2] Luk'avs Burget,et al. Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks , 2020, Comput. Speech Lang..

[3] Jean-Luc Gauvain,et al. Optimization of RNN-Based Speech Activity Detection , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[4] Kenneth Ward Church,et al. Third DIHARD Challenge Evaluation Plan , 2020, ArXiv.

[5] Mari Ostendorf,et al. Efficient use of overlap information in speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[6] Marie Kunesová,et al. Detection of Overlapping Speech for the Purposes of Speaker Diarization , 2019, SPECOM.

[7] Naoyuki Kanda,et al. End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[8] Shota Horiguchi,et al. End-To-End Speaker Diarization as Post-Processing , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Mireia Díez,et al. Analysis of the but Diarization System for Voxconverse Challenge , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Jean Carletta,et al. Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[11] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[12] Delphine Charlet,et al. Impact of overlapping speech detection on speaker diarization for broadcast news and debates , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[14] Claude Barras,et al. Speaker Change Detection in Broadcast TV Using Bidirectional Long Short-Term Memory Networks , 2017, INTERSPEECH.

[15] Shinji Watanabe,et al. End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).

[16] Leibny Paola García-Perera,et al. Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Joon Son Chung,et al. Spot the conversation: speaker diarisation in the wild , 2020, INTERSPEECH.

[18] Pavel Korshunov,et al. Pyannote.Audio: Neural Building Blocks for Speaker Diarization , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Hervé Bredin,et al. pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.