Estimation of the Number of Speakers with Variational Bayesian PLDA in the DIHARD Diarization Challenge

This paper focuses on the estimation of the number of speakers for diarization in the context of the DIHARD Challenge at InterSpeech 2018. This evaluation seeks the improvement of the diarization task in challenging corpora (Youtube videos, meetings, court audios, etc), containing an undetermined number of speakers with different relevance in terms of speech contributions. Our proposal for the challenge is a system based on the ivector PLDA paradigm: Given some initial segmentation of the input audio we extract i-vector representations for each acoustic fragment. These i-vectors are clustered with a Fully Bayesian PLDA. This model, a generative model with latent variables as speaker labels, produces the diarization labels by means of Variational Bayes iterations. The number of speakers is decided by comparing multiple hypotheses according to different information criteria. These criteria are developed around the Evidence Lower Bound (ELBO) provided by our PLDA.

[1]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[2]  Sanjeev Khudanpur,et al.  Deep neural network-based speaker embeddings for end-to-end speaker verification , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[3]  Pietro Laface,et al.  Stream-based speaker segmentation using speaker factors and eigenvoices , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Niko Brümmer,et al.  The speaker partitioning problem , 2010, Odyssey.

[5]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[6]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Eduardo Lleida,et al.  Quality Assessment for Speaker Diarization and Its Application in Speaker Characterization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[9]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[10]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Quan Wang,et al.  Speaker Diarization with LSTM , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[15]  Eduardo Lleida,et al.  Unsupervised adaptation of PLDA by using variational Bayes methods , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Themos Stafylakis,et al.  Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Mark J. F. Gales,et al.  The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[18]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[19]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[20]  Patrick Kenny,et al.  Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms , 2006 .

[21]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[22]  Eduardo Lleida,et al.  Domain Adaptation of PLDA Models in Broadcast Diarization by Means of Unsupervised Speaker Clustering , 2017, INTERSPEECH.

[23]  Eduardo Lleida,et al.  Variational Bayesian PLDA for speaker diarization in the MGB challenge , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).