Statistical Utterance Comparison for Speaker Clustering Using Factor Analysis

We propose a novel method of measuring the similarity between two or more speech utterances for speaker clustering, based on probability theory and factor analysis. The similarity function is formulated as the probability that the utterances originated from the same speaker, and uses statistical eigenvoice and eigenchannel models to incorporate physical knowledge of interspeaker and intraspeaker variabilities, allowing the similarity function to be trainable and robust. The comparison function can be efficiently computed using a compact set of sufficient statistics for each speech utterance, allowing the acoustic features to be discarded. We begin using only eigenvoices, and then show how the eigenchannels can be incorporated into the equation to result in an identical form but with a different set of sufficient statistics. We test the proposed model in a speaker clustering task using the CALLHOME telephone conversation corpus and show that it performs better than two other well-known similarity measures: the Cross-Likelihood Ratio (CLR) and Generalized Likelihood Ratio (GLR).

[1]  Hsin-Min Wang,et al.  BIC-Based Speaker Segmentation Using Divide-and-Conquer Strategies With Application to Speaker Diarization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Roland Kuhn,et al.  Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[4]  Masafumi Nishida,et al.  Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing , 2005, IEEE Transactions on Speech and Audio Processing.

[5]  Michael Picheny,et al.  Speaker clustering and transformation for speaker adaptation in speech recognition systems , 1998, IEEE Trans. Speech Audio Process..

[6]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Thomas S. Huang,et al.  Fishervoice and semi-supervised speaker clustering , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Fabio Valente,et al.  An Information Theoretic Approach to Speaker Diarization of Meeting Data , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Douglas A. Reynolds,et al.  Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[10]  Javier Ferreiros,et al.  Speaker Diarization Based on Intensity Channel Contribution , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Woojay Jeon,et al.  An utterance comparison model for speaker clustering using factor analysis , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  G. Ruske,et al.  Robust speaker clustering in eigenspace , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[14]  Hsin-Min Wang,et al.  Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Yonghong Yan,et al.  A novel speaker clustering algorithm via supervised affinity propagation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Pietro Laface,et al.  Stream-based speaker segmentation using speaker factors and eigenvoices , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Andreas Stolcke,et al.  Leveraging speaker diarization for meeting recognition from distant microphones , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Hagai Aronowitz Trainable speaker diarization , 2007, INTERSPEECH.

[19]  Chung-Hsien Wu,et al.  Speaker Clustering Using Decision Tree-Based Phone Cluster Models With Multi-Space Probability Distributions , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Shrikanth S. Narayanan,et al.  Novel inter-cluster distance measure combining GLR and ICR for improved agglomerative hierarchical speaker clustering , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Elie el Khoury,et al.  Improved speaker diarization system for meetings , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.