Generative model-based speaker clustering via mixture of von Mises-Fisher distributions

This paper proposes a generative model-based speaker clustering algorithm in the maximum a posteriori adapted Gaussian mixture model (GMM) mean supervector space. The algorithm can be viewed as an extension of the standard expectation maximization algorithm for fitting a mixture model to the data, which iterates between two steps - a sample re-assignment step (E-step) and a model re-estimation step (M-step) - until it converges. The directional scattering patterns of GMM mean supervectors suggest that we employ a mixture of von Mises-Fisher distributions in the model re-estimation step. In the sample re-assignment step, four sample-to-mixture assignment strategies, namely soft, hard, stochastic, and deterministic annealing assignments, are used. Our experiments on the GALE Mandarin dataset show that the use of a mixture of von Mises-Fisher distributions as the underlying model yields significantly higher speaker clustering accuracies than the use of a mixture of Gaussian distributions. It is further shown that deterministic annealing assignment outperforms soft assignment, that soft assignment is comparable to stochastic assignment, and that both soft and stochastic assignments outperform hard assignment.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Itshak Lapidot,et al.  Unsupervised speaker recognition based on competition between self-organizing maps , 2002, IEEE Trans. Neural Networks.

[3]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[4]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[5]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[7]  Douglas A. Reynolds,et al.  Blind clustering of speech utterances based on speaker and language characteristics , 1998, ICSLP.

[8]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Hsin-Min Wang,et al.  Speaker clustering of speech utterances using a voice characteristic reference space , 2004, INTERSPEECH.

[11]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[13]  David G. Stork,et al.  Pattern Classification , 1973 .

[14]  F. Kubala,et al.  Automatic Speaker Clustering , 1997 .

[15]  Kiyohiro Shikano,et al.  Unsupervised speaker adaptation for robust speech recognition in real environments , 2005 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[18]  Jr. J.P. Campbell,et al.  Speaker recognition: a tutorial , 1997, Proc. IEEE.

[19]  G. Ruske,et al.  Robust speaker clustering in eigenspace , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[20]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[21]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[22]  Yi Liu,et al.  Recent advances in the IBM GALE Mandarin transcription system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[24]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.