on Bayesian HMM with Eigenvoice Priors

Nowadays, most speaker diarization methods address the task in two steps: segmentation of the input conversation into (preferably) speaker homogeneous segments, and clustering. Generally, different models and techniques are used for the two steps. In this paper we present a very elegant approach where a straightforward and efficient Variational Bayes (VB) inference in a single probabilistic model addresses the complete SD problem. Our model is a Bayesian Hidden Markov Model, in which states represent speaker specific distributions and transitions between states represent speaker turns. As in the ivector or JFA models, speaker distributions are modeled by GMMs with parameters constrained by eigenvoice priors. This allows to robustly estimate the speaker models from very short speech segments. The model, which was released as open source code and has already been used by several labs, is fully described for the first time in this paper. We present results and the system is compared and combined with other state-of-the-art approaches. The model provides the best results reported so far on the CALLHOME dataset.

[1]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[2]  Alan McCree,et al.  Speaker diarization using deep neural network embeddings , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Lukás Burget,et al.  Analysis of Speaker Recognition Systems in Realistic Scenarios of the SITW 2016 Challenge , 2016, INTERSPEECH.

[4]  Daniel Garcia-Romero,et al.  Diarization resegmentation in the factor analysis subspace , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Daniel Garcia-Romero,et al.  Speaker diarization with plda i-vector scoring and unsupervised calibration , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[6]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Lukás Burget,et al.  Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  P. Motlícek,et al.  Variational Bayesian speaker diarization of meeting recordings , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Michael I. Jordan,et al.  The Sticky HDP-HMM: Bayesian Nonparametric Hidden Markov Models with Persistent States , 2009 .

[11]  Pietro Laface,et al.  Stream-based speaker segmentation using speaker factors and eigenvoices , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  James H. Elder,et al.  Probabilistic Linear Discriminant Analysis for Inferences About Identity , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[13]  Patrick Kenny,et al.  Joint Factor Analysis Versus Eigenchannels in Speaker Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  C. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[16]  Jean-François Bonastre,et al.  Step-by-step and integrated approaches in broadcast news speaker diarization , 2006, Comput. Speech Lang..

[17]  Fabio Valente,et al.  Variational Bayesian Methods for Audio Indexing , 2005, MLMI.

[18]  Jean-Luc Gauvain,et al.  Improving Speaker Diarization , 2004 .

[19]  Sue E. Johnson,et al.  Who spoke when? - automatic segmentation and clustering for determining speaker turns , 1999, EUROSPEECH.

[20]  Jean-Luc Gauvain,et al.  Partitioning and transcription of broadcast news data , 1998, ICSLP.

[21]  Federico Landini,et al.  Analysis of Speaker Diarization Based on Bayesian HMM With Eigenvoice Priors , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[24]  P. Somervuo,et al.  Bayesian Analysis of Speaker Diarization with Eigenvoice Priors , 2008 .

[25]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[26]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[27]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[28]  F. Kubala,et al.  Automatic Speaker Clustering , 1997 .