Speaker Detection and Clustering with SVM Technique in Persian Conversational Speech

First stage in many application of speech processing (for example multi-speaker speech source separation and speaker adaptation in automatic transcription of conversational speech) is speaker detection and clustering. In this paper, we describe the speaker detection and clustering for natural Persian conversational speech. We implement our system using a support vector machine classifier trained on Mel-frequency cepstral coefficients (MFCCs), delta MFCCs. The Support Vector Machine (SVM) classifier is trained using example signals from classes to scan the continuous speech signal of multi-speaker data in FARSDAT database (standard database in Persian). The results indicate that the this method enables better separation quality than existing methods based on Gaussian Mixture Model (GMM) and Vector Quantization based classifier.

[1]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[2]  C. Zheng,et al.  ; 0 ; , 1951 .

[3]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[4]  Charles L. Wayne Topic detection & Tracking: a case study in corpues creation & evaluation methodologies , 1998, LREC.

[5]  Shantanu Chakrabartty,et al.  Support vector machines for segmental minimum Bayes risk decoding of continuous speech , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[6]  William J. Byrne,et al.  Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition , 2006, Speech Commun..

[7]  Steve Young,et al.  Segment generation and clustering in the HTK broadcast news transcription system , 1998 .

[8]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[9]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[10]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[11]  Christian Wellekens,et al.  A speaker tracking system based on speaker turn detection for NIST evaluation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Christian Wellekens,et al.  Detection of speaker changes in an audio document , 1999, 6th European Conference on Speech Communication and Technology (Eurospeech 1999).

[13]  David R. Musicant,et al.  Successive overrelaxation for support vector machines , 1999, IEEE Trans. Neural Networks.