Initialization of Iterative-Based Speaker Diarization Systems for Telephone Conversations

Speaker diarization systems attempt to assign temporal segments from a conversation between R speakers to an appropriate speaker r. This task is generally performed when no prior information is given regarding the speakers. The number of speakers is usually unknown and needs to be estimated. However, there are applications where the number of speakers is known in advance. The diarization process generally consists of change detection, clustering and labeling of a given audio stream. Speaker diarization can be performed using an iterative approach that is optimized by the selection of appropriate initial conditions. This study examines the influence of several common initialization algorithms including two variants of a recently proposed, K-means based initialization algorithm over the performance of an iterative-based speaker diarization system applied to two speaker telephone conversations. The suggested speaker diarization system employs either self organizing maps or Gaussian mixture models in order to model the speakers and non-speech in the conversation. The diarization system and initialization algorithms are tuned using 108 telephone conversations taken from LDC CallHome corpus, this is the development set. The evaluation subset is composed of 2048 telephone conversations extracted from the NIST 2005 Rich Transcription corpus. The results obtained show that by initializing the speaker diarization system using the K-means based algorithms provide a relative improvement of 10.4% for the LDC development set and 12.2% for the NIST evaluation subset when compared to random initialization after 12 iterations which are required for the convergence of the diarization process using random initialization. However, when using the K-means based initialization approach, only five iterations are required for the system to converge. Thus, using the new initialization allows us to improve the performances both in terms of diarization error rate and speed of convergence.

[1]  Itshak Lapidot SOM as likelihood estimator for speaker clustering , 2003, INTERSPEECH.

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  Jorma Laaksonen,et al.  Variants of self-organizing maps , 1990, International 1989 Joint Conference on Neural Networks.

[4]  Patrick Kenny,et al.  Combining Gaussianized/Non-Gaussianized Features to Improve Speaker Diarization of Telephone Conversations , 2007, IEEE Signal Processing Letters.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  Juan Manuel Górriz,et al.  Jointly Gaussian PDF-Based Likelihood Ratio Test for Voice Activity Detection , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Xavier Anguera Miró,et al.  Friends and enemies: a novel initialization for speaker diarization , 2006, INTERSPEECH.

[8]  Siu Cheung Hui,et al.  Citation-Based Retrieval for Scholarly Publications , 2003, IEEE Intell. Syst..

[9]  Itshak Lapidot,et al.  Unsupervised speaker recognition based on competition between self-organizing maps , 2002, IEEE Trans. Neural Networks.

[10]  Gerald Friedland,et al.  Tuning-Robust Initialization Methods for Speaker Diarization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Terrence J. Sejnowski,et al.  Speech Enhancement, Gain, and Noise Spectrum Adaptation Using Approximate Bayesian Estimation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[13]  Teuvo Kohonen,et al.  The self-organizing map , 1990, Neurocomputing.

[14]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[15]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[16]  Douglas A. Reynolds,et al.  Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  W. Marsden I and J , 2012 .

[19]  Satoshi Nakamura,et al.  A neural speaker model for speaker clustering , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[20]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[21]  Andrew R. Barron,et al.  Mixture Density Estimation , 1999, NIPS.

[22]  Itshak Lapidot,et al.  Segmental K-Means initialization for SOM-based speaker clustering , 2008, 2008 50th International Symposium ELMAR.

[23]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[24]  Dit-Yan Yeung,et al.  Mixtures of ARMA models for model-based time series clustering , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[25]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[26]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[27]  Alvin F. Martin,et al.  The NIST 2010 speaker recognition evaluation , 2010, INTERSPEECH.

[28]  Nam Soo Kim,et al.  Voice Activity Detection Based on Conditional MAP Criterion , 2008, IEEE Signal Processing Letters.