Generalized Viterbi-based models for time-series segmentation and clustering applied to speaker diarization

Abstract Speaker diarization is a problem of separating unknown speakers in a conversation into homogeneous parts in the speaker sense. State-of-the-art diarization systems are based on i-vector methodologies. However, these approaches require large quantities of training data, which must be obtained from an environment that is similar to that of the conversation being diarized. In this paper we present a diarization system that does not require such training data but instead can suffice with some development data for parameter-tuning. This system is a generalization of the well-known hidden Markov model (HMM), a popular clustering algorithm trained by Viterbi statistics. Our proposed model, referred to as a hidden distortion model (HDM), is based on state distortion models and transition costs, for which probabilistic calculations are not mandatory, in contrast to the case of HMM. We provide a mathematical basis for our approach, and we demonstrate that Viterbi-based HMM can be seen as a special case of HDM. This proximity allows us to apply similar approaches for state-model training when the new paradigm is used to learn sequence dependencies. We carry out diarizations of two-speaker telephone conversations in order to evaluate the performance of HDM. When applied to conversations from the LDC CALLHOME database, HDM improves on the performance of a baseline HMM system by about 26% (relative improvement). Moreover, when applied to the NIST 2005 database, it yields a small improvement over the HMM system.

[1]  Gary Benson,et al.  Tandem repeats over the edit distance , 2007, Bioinform..

[2]  Jean-Claude Junqua,et al.  Gaussian dynamic warping (GDW) method applied to text-dependent speaker detection and verification , 2003, INTERSPEECH.

[3]  Lawrence Carin,et al.  Dirichlet Process HMM Mixture Models with Application to Music Analysis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Michael I. Jordan,et al.  An HDP-HMM for systems with state persistence , 2008, ICML '08.

[5]  Jun Ye,et al.  Sparse geostatistical analysis in clustering fMRI time series , 2011, Journal of Neuroscience Methods.

[6]  Georges Linarès,et al.  Generalized driven decoding for speech recognition system combination , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[8]  Itshak Lapidot,et al.  Frame level entropy based overlapped speech detection as a pre-processing stage for speaker diarization , 2009, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[9]  James R. Glass,et al.  Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Itshak Lapidot,et al.  Segmental K-Means initialization for SOM-based speaker clustering , 2008, 2008 50th International Symposium ELMAR.

[11]  Mickael Rouvier,et al.  An open-source state-of-the-art toolbox for broadcast news diarization , 2013, INTERSPEECH.

[12]  James R. Glass,et al.  Exploiting Intra-Conversation Variability for Speaker Diarization , 2011, INTERSPEECH.

[13]  Hervé Bourlard,et al.  Unknown-multiple speaker clustering using HMM , 2002, INTERSPEECH.

[14]  Themos Stafylakis,et al.  A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[16]  Douglas A. Reynolds,et al.  Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[17]  Hugo Guterman,et al.  Initialization of Iterative-Based Speaker Diarization Systems for Telephone Conversations , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Bohyung Han,et al.  Modeling and segmentation of floating foreground and background in videos , 2012, Pattern Recognit..

[19]  Nicholas W. D. Evans,et al.  The lia-eurecom RT'09 speaker diarization system: Enhancements in speaker modelling and cluster purification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Itshak Lapidot SOM as likelihood estimator for speaker clustering , 2003, INTERSPEECH.

[21]  Themos Stafylakis,et al.  Compensation for inter-frame correlations in speaker diarization and recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.