The TNO Speaker Diarization System for NIST RT05s Meeting Data

The TNO speaker speaker diarization system is based on a standard BIC segmentation and clustering algorithm. Since for the NIST Rich Transcription speaker dizarization evaluation measure correct speech detection appears to be essential, we have developed a speech activity detector (SAD) as well. This is based on decoding the speech signal using two Gaussian Mixture Models trained on silence and speech. The SAD was trained on only AMI development test data, and performed quite well in the evaluation on all 5 meeting locations, with a SAD error rate of 5.0 %. For the speaker clustering algorithm we optimized the BIC penalty parameter λ to 14, which is quite high with respect to the theoretical value of 1. The final speaker diarization error rate was evaluated at 35.1 %.

[1]  Dan Istrate,et al.  NIST RT'05S Evaluation: Pre-processing Techniques and Speaker Diarization on Multiple Microphone Meetings , 2005, MLMI.

[2]  Hervé Bourlard,et al.  Robust speaker change detection , 2004, IEEE Signal Processing Letters.

[3]  Tanja Schultz,et al.  Speaker segmentation and clustering in meetings , 2004, INTERSPEECH.

[4]  Jonathan G. Fiscus,et al.  The Rich Transcription 2005 Spring Meeting Recognition Evaluation , 2005, MLMI.

[5]  H. Gish,et al.  Text-independent speaker identification , 1994, IEEE Signal Processing Magazine.

[6]  Lukás Burget,et al.  The AMI Meeting Transcription System: Progress and Performance , 2006, MLMI.

[7]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..

[8]  Ramesh A. Gopinath,et al.  Improved speaker segmentation and segments clustering using the bayesian information criterion , 1999, EUROSPEECH.

[9]  Jean-Luc Gauvain,et al.  Combining speaker identification and BIC for speaker diarization , 2005, INTERSPEECH.

[10]  Steve Cassidy The Macquarie Speaker Diarisation System for RT04S , 2004 .

[11]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[12]  Steve Renals,et al.  THE USE OF RECURRENT NEURAL NETWORKS IN CONTINUOUS SPEECH RECOGNITION , 1996 .

[13]  Andrey Temko,et al.  Robust Speech Activity Detection in Interactive Smart-Room Environments , 2006, MLMI.

[14]  Xavier Anguera Miró,et al.  Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System , 2005, MLMI.

[15]  Christian Wellekens,et al.  DISTBIC: A speaker-based segmentation for audio data indexing , 2000, Speech Commun..

[16]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[18]  David A. van Leeuwen,et al.  Results of the 2003 NFI-TNO forensic speaker recognition evaluation , 2004, Odyssey.

[19]  Jean-François Bonastre,et al.  The NIST 2004 spring rich transcription evaluation : two-axis merging strategy in the context of multiple distance microphone based meeting speaker segmentation , 2004 .

[20]  Kadri Hacioglu,et al.  Recent improvements in the CU Sonic ASR system for noisy speech: the SPINE task , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[21]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[22]  D. A. van Leeuwen The (TNO) Speaker Diarization System for NIST Rich Transcription Evaluation 2005 for meeting data , 2005 .