Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation

We propose to address online speaker diarization as a combination of incremental clustering and local diarization applied to a rolling buffer updated every 500ms. Every single step of the proposed pipeline is designed to take full advantage of the strong ability of a recently proposed end-to-end overlapaware segmentation to detect and separate overlapping speakers. In particular, we propose a modified version of the statistics pooling layer (initially introduced in the x-vector architecture) to give less weight to frames where the segmentation model predicts simultaneous speakers. Furthermore, we derive cannot-link constraints from the initial segmentation step to prevent two local speakers from being wrongfully merged during the incremental clustering step. Finally, we show how the latency of the proposed approach can be adjusted between 500ms and 5s to match the requirements of a particular use case, and we provide a systematic analysis of the influence of latency on the overall performance (on AMI, DIHARD and VoxConverse).

[1]  Marc Delcroix,et al.  Integrating End-to-End Neural and Clustering-Based Diarization: Getting the Best of Both Worlds , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Luk'avs Burget,et al.  Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks , 2020, Comput. Speech Lang..

[3]  Antoine Laurent,et al.  End-to-end speaker segmentation for overlap-aware resegmentation , 2021, Interspeech.

[4]  Joon Son Chung,et al.  Spot the conversation: speaker diarisation in the wild , 2020, INTERSPEECH.

[5]  Petr Motlícek,et al.  Integrating online i-vector extractor with information bottleneck based speaker diarization system , 2015, INTERSPEECH.

[6]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[7]  Ming Li,et al.  LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization , 2019, INTERSPEECH.

[8]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[9]  Hervé Bredin,et al.  pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[10]  Kenneth Ward Church,et al.  The Second DIHARD Diarization Challenge: Dataset, task, and baselines , 2019, INTERSPEECH.

[11]  Kenneth Ward Church,et al.  Third DIHARD Challenge Evaluation Plan , 2020, ArXiv.

[12]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[13]  Shota Horiguchi,et al.  End-To-End Speaker Diarization as Post-Processing , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Shuai Wang,et al.  But System for the Second Dihard Speech Diarization Challenge , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Mari Ostendorf,et al.  Efficient use of overlap information in speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[16]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[17]  Mireia Díez,et al.  BUT System for DIHARD Speech Diarization Challenge 2018 , 2018, INTERSPEECH.

[18]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[19]  Kenneth Ward Church,et al.  The Third DIHARD Diarization Challenge , 2020, Interspeech.

[20]  Jean Carletta,et al.  Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus , 2007, Lang. Resour. Evaluation.

[21]  Shinji Watanabe,et al.  Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers , 2021, Interspeech.

[22]  Leibny Paola García-Perera,et al.  Overlap-Aware Diarization: Resegmentation Using Neural End-to-End Overlapped Speech Detection , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Josh H McDermott,et al.  Statistics of natural reverberation enable perceptual separation of sound and space , 2016, Proceedings of the National Academy of Sciences.

[26]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[27]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28]  Shrikanth Narayanan,et al.  Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap , 2020, IEEE Signal Processing Letters.

[29]  Joon Son Chung,et al.  VoxCeleb: A Large-Scale Speaker Identification Dataset , 2017, INTERSPEECH.