Dive: End-to-End Speech Diarization Via Iterative Speaker Embedding

We introduce DIVE, an end-to-end speaker diarization sys-tem. DIVE presents the diarization task as an iterative pro-cess: it repeatedly builds a representation for each speaker before predicting their voice activity conditioned on the ex-tracted representations. This strategy intrinsically resolves the speaker ordering ambiguity without requiring the classi-cal permutation invariant training loss. In contrast with prior work, our model does not rely on pretrained speaker represen-tations and jointly optimizes all parameters of the system with a multi-speaker voice activity loss. DIVE does not require the training speaker identities and allows efficient window-based training. Importantly, our loss explicitly excludes unreliable speaker turn boundaries from training, which is adapted to the standard collar-based Diarization Error Rate (DER) eval-uation. Overall, these contributions yield a system redefining the state-of-the-art on the CALLHOME benchmark, with 6.7% DER compared to 7.8% for the best alternative.

[1]  Kyu J. Han,et al.  A Review of Speaker Diarization: Recent Advances with Deep Learning , 2021, Comput. Speech Lang..

[2]  Shota Horiguchi,et al.  End-To-End Speaker Diarization as Post-Processing , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Kenneth Ward Church,et al.  The Third DIHARD Diarization Challenge , 2020, Interspeech.

[4]  Shinji Watanabe,et al.  Online End-to-End Neural Diarization with Speaker-Tracing Buffer , 2020, ArXiv.

[5]  Neil Zeghidour,et al.  Wavesplit: End-to-End Speech Separation by Speaker Clustering , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  P. Woodland,et al.  Discriminative Neural Clustering for Speaker Diarisation , 2019, 2021 IEEE Spoken Language Technology Workshop (SLT).

[7]  Shinji Watanabe,et al.  Neural Speaker Diarization with Speaker-Wise Chain Rule , 2020, ArXiv.

[8]  Shinji Watanabe,et al.  End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors , 2020, INTERSPEECH.

[9]  Jon Barker,et al.  CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings , 2020, 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020).

[10]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Self-Attention , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Naoyuki Kanda,et al.  End-to-End Neural Speaker Diarization with Permutation-Free Objectives , 2019, INTERSPEECH.

[12]  Ming Li,et al.  LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization , 2019, INTERSPEECH.

[13]  Tomohiro Nakatani,et al.  All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Quan Wang,et al.  Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Hervé Bredin,et al.  pyannote.metrics: A Toolkit for Reproducible Evaluation, Diagnostic, and Error Analysis of Speaker Diarization Systems , 2017, INTERSPEECH.

[17]  Petr Fousek,et al.  Developing On-Line Speaker Diarization System , 2017, INTERSPEECH.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Radu Horaud,et al.  An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Jesper Jensen,et al.  Permutation invariant training of deep models for speaker-independent multi-talker speech separation , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[23]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[24]  Daniel Povey,et al.  MUSAN: A Music, Speech, and Noise Corpus , 2015, ArXiv.

[25]  Brian Kingsbury,et al.  Improvements to the IBM speech activity detection system for the DARPA RATS program , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[29]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Thad Hughes,et al.  Recurrent neural networks for voice activity detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Thomas S. Huang,et al.  A spectral clustering approach to speaker diarization , 2006, INTERSPEECH.