Probabilistic embeddings for speaker diarization

Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization. We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix, thus providing a path for the propagation of information about the quality of the speech segment into a PLDA scoring backend. These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments. The proposed probabilistic embeddings (x-vectors with precisions) are interfaced with the PLDA model by treating the x-vectors as hidden variables and marginalizing them out. We apply the proposed probabilistic embeddings as input to an agglomerative hierarchical clustering (AHC) algorithm to do diarization in the DIHARD'19 evaluation set. We compute the full PLDA likelihood 'by the book' for each clustering hypothesis that is considered by AHC. We do joint discriminative training of the PLDA parameters and of the probabilistic x-vector extractor. We demonstrate accuracy gains relative to a baseline AHC algorithm, applied to traditional xvectors (without uncertainty), and which uses averaging of binary log-likelihood-ratios, rather than by-the-book scoring.

[1]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[2]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[3]  Mireia Díez,et al.  BUT System for DIHARD Speech Diarization Challenge 2018 , 2018, INTERSPEECH.

[4]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Themos Stafylakis,et al.  PLDA for speaker verification with utterances of arbitrary duration , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Steffen Lauritzen,et al.  PROPER LOCAL SCORING RULES ON DISCRETE SAMPLE SPACES , 2011, 1104.2224.

[8]  Biing-Hwang Juang,et al.  Speech Dereverberation Based on Variance-Normalized Delayed Linear Prediction , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Reinhold Haeb-Umbach,et al.  NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing , 2018, ITG Symposium on Speech Communication.

[10]  Douglas A. Reynolds,et al.  Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[11]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[12]  Mireia Díez,et al.  Optimizing Bayesian Hmm Based X-Vector Clustering for the Second Dihard Speech Diarization Challenge , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Michael I. Jordan,et al.  A Sticky HDP-HMM With Application to Speaker Diarization , 2009, 0905.2592.

[14]  Seong Joon Oh,et al.  Modeling Uncertainty with Hedged Instance Embedding , 2018, ICLR 2018.

[15]  Joon Son Chung,et al.  VoxCeleb2: Deep Speaker Recognition , 2018, INTERSPEECH.

[16]  Pietro Laface,et al.  On the use of i–vector posterior distributions in Probabilistic Linear Discriminant Analysis , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Themos Stafylakis,et al.  Compensation for inter-frame correlations in speaker diarization and recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[19]  Shinji Watanabe,et al.  Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge , 2018, INTERSPEECH.

[20]  Shuai Wang,et al.  BUT System Description for DIHARD Speech Diarization Challenge 2019 , 2019 .

[21]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[22]  Quan Wang,et al.  Fully Supervised Speaker Diarization , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Lukás Burget,et al.  Gaussian meta-embeddings for efficient scoring of a heavy-tailed PLDA model , 2018, Odyssey.

[24]  J. Pitman Exchangeable and partially exchangeable random partitions , 1995 .

[25]  Niko Brümmer,et al.  Generative modelling for unsupervised score calibration , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[27]  Alan McCree,et al.  Speaker Diarization Using Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings , 2019, INTERSPEECH.

[28]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .