Unsupervised speaker change detection using SVM training misclassification rate

This work presents an unsupervised speaker change detection algorithm based on support vector machines (SVM) to detect speaker change (SC) in a speech stream. The proposed algorithm is called the SVM training misclassification rate (STMR). The STMR can identify SCs with less speech data collection, making it capable of detecting speaker segments with short duration. According to experiments on the NIST Rich Transcription 2005 Spring Evaluation (RT-05S) corpus, the STMR has a missed detection rate of only 19.67 percent.

[1]  Lie Lu,et al.  Speaker change detection and tracking in real-time news broadcasting analysis , 2002, MULTIMEDIA '02.

[2]  Nils J. Nilsson,et al.  Learning Machines: Foundations of Trainable Pattern-Classifying Systems , 1965 .

[3]  Mahesh Viswanathan,et al.  Retrieval from spoken documents using content and speaker information , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[4]  Hsin-Min Wang,et al.  A sequential metric-based audio segmentation method via the Bayesian information criterion , 2003, INTERSPEECH.

[5]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[6]  Ramesh A. Gopinath,et al.  Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[7]  Christian Wellekens,et al.  A speaker tracking system based on speaker turn detection for NIST evaluation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[8]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[9]  Lie Lu,et al.  Content-based audio segmentation using support vector machines , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[10]  Jean-François Bonastre,et al.  The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Masafumi Nishida,et al.  Speaker model selection based on the Bayesian information criterion applied to unsupervised speaker indexing , 2005, IEEE Transactions on Speech and Audio Processing.

[12]  Mark A. Clements,et al.  Acoustic change detection and segment clustering of two-way telephone conversations , 2003, INTERSPEECH.

[13]  John H. L. Hansen,et al.  Advances in unsupervised audio segmentation for the broadcast news and NGSW corpora , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Masafumi Nishida,et al.  Speaker indexing and adaptation using speaker clustering based on statistical model selection , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Jean-Marc Boite,et al.  SPEAKER TRACKING IN BROADCAST AUDIO MATERIAL IN THE FRAMEWORK OF THE THISL PROJECT , 1999 .

[16]  C. Sekhar,et al.  Speaker Change Detection using Support Vector Machines , 2005 .

[17]  Daben Liu,et al.  Fast speaker change detection for broadcast news transcription and indexing , 1999, EUROSPEECH.

[18]  Tanja Schultz,et al.  Speaker segmentation and clustering in meetings , 2004, INTERSPEECH.

[19]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[20]  Mauro Cettolo,et al.  Evaluation of BIC-based algorithms for audio segmentation , 2005, Comput. Speech Lang..

[21]  Sancho Salcedo-Sanz,et al.  Offline speaker segmentation using genetic algorithms and mutual information , 2006, IEEE Transactions on Evolutionary Computation.

[22]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[23]  Alexander H. Waibel,et al.  Strategies for automatic segmentation of audio data , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[24]  Aladdin M. Ariyaeeinia,et al.  On the use of the Bayesian information criterion in multiple speaker detection , 2001, INTERSPEECH.

[25]  Christian Wellekens,et al.  Audio data indexing: Use of second-order statistics for speaker-based segmentation , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[26]  Lynn Wilcox,et al.  Audio indexing using speaker identification , 1994, Optics & Photonics.

[27]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[28]  João Paulo da Silva Neto,et al.  Audio segmentation, classification and clustering in a broadcast news task , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  Jeih-Weih Hung,et al.  Automatic metric-based speech segmentation for broadcast news via principal component analysis , 2000, INTERSPEECH.

[30]  John H. L. Hansen,et al.  Efficient audio stream segmentation via the combined T/sup 2/ statistic and Bayesian information criterion , 2005, IEEE Transactions on Speech and Audio Processing.

[31]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[32]  Giles M. Foody,et al.  A relative evaluation of multiclass image classification by support vector machines , 2004, IEEE Transactions on Geoscience and Remote Sensing.

[33]  Hsin-Min Wang,et al.  METRIC-SEQDAC: a hybrid approach for audio segmentation , 2004, INTERSPEECH.

[34]  Jean-François Bonastre,et al.  E-HMM approach for learning and adapting sound models for speaker indexing , 2001, Odyssey.

[35]  Seiichi Nakagawa,et al.  Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[36]  Guojun Lu,et al.  An investigation of automatic audio classification and segmentation , 2000, WCC 2000 - ICSP 2000. 2000 5th International Conference on Signal Processing Proceedings. 16th World Computer Congress 2000.

[37]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[38]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[39]  Mauro Cettolo Segmentation, classification and clustering of an Italian broadcast news corpus , 2000 .

[40]  R. Fletcher Practical Methods of Optimization , 1988 .

[41]  Puming Zhan,et al.  Progress in Broadcast News transcription at Dragon Systems , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[42]  Carmen García-Mateo,et al.  A multimedia approach for audio segmentation in TV broadcast news , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  John H. L. Hansen,et al.  Unsupervised audio stream segmentation and clustering via the Bayesian information criterion , 2000, INTERSPEECH.

[44]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[45]  Shrikanth S. Narayanan,et al.  Speaker change detection using a new weighted distance measure , 2002, INTERSPEECH.

[46]  Jean-Marc Odobez,et al.  Unsupervised Location-Based Segmentation of Multi-Party Speech , 2004 .