Learning essential speaker sub-space using hetero-associative neural networks for speaker clustering

In this paper, we present a novel approach to speaker clustering involving the use of hetero-associative neural network (HANN) to compute very low dimensional speaker discriminatory features (in our case 1-dimensional) in a data-driven manner. A HANN trained to map input feature space onto speaker labels through a bottle-neck hidden layer is expected to learn very low dimensional feature subspace essentially containing speaker information. The lower dimensional features are further used in a simple k-means clustering algorithm to obtain speaker segmentation. Evaluation of this approach on a database of real-life conversational speech from call-centers show that clustering performance achieved is similar to that of the state-ofthe-art systems, although our approach uses just 1-dimensional features. Augmenting these features with the traditional melfrequency cepstral coefficients (MFCC) features in the state-ofthe-art system resulted in improved clustering performance.

[1]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .

[2]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[3]  Vaibhava Goel,et al.  Rapid adaptation with linear combinations of rank-one matrices , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Karthik Visweswariah,et al.  Speech activity detection fusing acoustic phonetic and energy features , 2005, INTERSPEECH.

[5]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[6]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  Li Lee,et al.  Speaker normalization using efficient frequency warping procedures , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Jing Huang,et al.  Detection, diarization, and transcription of far-field lecture speech , 2007, INTERSPEECH.

[10]  Xavier Anguera Miró,et al.  Purity Algorithms for Speaker Diarization of Meetings Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Jean-Luc Gauvain,et al.  Combining speaker identification and BIC for speaker diarization , 2005, INTERSPEECH.

[12]  Jing Huang,et al.  The IBM RT07 Evaluation Systems for Speaker Diarization on Lecture Meetings , 2007, CLEAR.

[13]  Corinne Fredouille,et al.  Technical Improvements of the E-HMM Based Speaker Diarization System for Meeting Records , 2006, MLMI.

[14]  Bayya Yegnanarayana,et al.  Analysis of autoassociative mapping neural networks , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).