论文信息 - Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding

Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding

We introduce DIVE, an end-to-end speaker diarization sys-tem. DIVE presents the diarization task as an iterative pro-cess: it repeatedly builds a representation for each speaker before predicting their voice activity conditioned on the ex-tracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classi-cal permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker represen-tations and jointly optimizes all parameters of the system with a multi-speaker voice activity loss. DIVE does not require the training speaker identities and allows efficient window-based training. Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) eval-uation. Overall, these contributions yield a system redefining the state-of-the-art on the CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative.

[1] Kyu J. Han,et al. A Review of Speaker Diarization: Recent Advances with Deep Learning , 2021, Comput. Speech Lang..

[2] Shota Horiguchi,et al. End-To-End Speaker Diarization as Post-Processing , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Kenneth Ward Church,et al. The Third DIHARD Diarization Challenge , 2020, Interspeech.

[4] Shinji Watanabe,et al. Online End-to-End Neural Diarization with Speaker-Tracing Buffer , 2020, ArXiv.

[5] Neil Zeghidour,et al. Wavesplit: End-to-End Speech Separation by Speaker Clustering , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6] P. Woodland,et al. Discriminative Neural Clustering for Speaker Diarisation , 2019, 2021 IEEE Spoken Language Technology Workshop (SLT).

[7] Shinji Watanabe,et al. Neural Speaker Diarization with Speaker-Wise Chain Rule , 2020, ArXiv.

[8] Shinji Watanabe,et al. End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors , 2020, INTERSPEECH.

[9] Jon Barker,et al. CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[10] Naoyuki Kanda,et al. End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11] Naoyuki Kanda,et al. End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[12] Ming Li,et al. LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization , 2019, INTERSPEECH.

[13] Tomohiro Nakatani,et al. All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Quan Wang,et al. Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Quan Wang,et al. Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Hervé Bredin,et al. pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[17] Petr Fousek,et al. Developing On-Line Speaker Diarization System , 2017, INTERSPEECH.

[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[19] Alan McCree,et al. Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Radu Horaud,et al. An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Jesper Jensen,et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[23] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[24] Daniel Povey,et al. MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[25] Brian Kingsbury,et al. Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28] Daniel Garcia-Romero,et al. Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[29] James R. Glass,et al. Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[30] Thad Hughes,et al. Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31] Nicholas W. D. Evans,et al. Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[32] Douglas A. Reynolds,et al. An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[33] Thomas S. Huang,et al. A spectral clustering approach to speaker diarization , 2006, INTERSPEECH.