Locality preserving speaker clustering

In this paper, we propose an efficient speaker clustering approach based on a locality preserving linear projective mapping in the Gaussian mixture model (GMM) mean supervector space. While the GMM mean supervector has turned out to be an effective representation of speakers, its dimensionality is usually very high. The locality preserving projection (LPP) maps the high-dimensional GMM mean supervector space into a lower-dimensional subspace in an unsupervised fashion where the local neighborhood structure of the data points is optimally preserved. Our speaker clustering experiments clearly show that in the reduced-dimensional LPP subspace, traditional clustering techniques such as k-means and hierarchical clustering perform significantly better than they would in the original high-dimensional GMM mean supervector space and in its principal component subspace.

[1]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[2]  Yi Liu,et al.  Recent advances in the IBM GALE Mandarin transcription system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[4]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[5]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[7]  Herbert Gish,et al.  Clustering speakers by their voices , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[10]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[11]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[13]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[14]  Thomas S. Huang,et al.  Generative model-based speaker clustering via mixture of von Mises-Fisher distributions , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  F. Kubala,et al.  Automatic Speaker Clustering , 1997 .

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Douglas E. Sturim,et al.  Support vector machines using GMM supervectors for speaker verification , 2006, IEEE Signal Processing Letters.

[18]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[19]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.