Robust Speaker Diarization for Meetings: ICSI RT06S Meetings Evaluation System

In this paper we present the ICSI speaker diarization system submitted for the NIST Rich Transcription evaluation (RT06s) [1] conducted on the meetings environment. The presented system is based on the RT05s system, which uses agglomerative clustering with a modified Bayesian Information Criterion (BIC) measure to decide which pairs of clusters to merge and to determine when to stop merging clusters. In this year's system we have eliminated any remaining need for training data, therefore increasing robustness. In our primary system we have introduced several improvements from last year. First, we use a new training-free speech/non-speech detection algorithm. Second, we introduce a new algorithm for system initialization. The third improvement is the use of a frame purification algorithm to increase cluster discriminability. Finally, we describe the use of inter-channel delays as features. We explain each of these improvements and show our system's results on the official evaluation data using hand-aligned references and forced-alignments. We also analyze some of the results and propose improvements.

[1]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[2]  Climent Nadeu,et al.  Hybrid Speech/non-speech detector applied to Speaker Diarization of Meetings , 2006, 2006 IEEE Odyssey - The Speaker and Language Recognition Workshop.

[3]  X. Anguera,et al.  Speaker diarization for multi-party meetings using acoustic fusion , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[4]  Xavier Anguera Miró,et al.  Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences , 2006, INTERSPEECH.

[5]  Andreas Stolcke,et al.  Further Progress in Meeting Recognition: The ICSI-SRI Spring 2005 Speech-to-Text Evaluation System , 2005, MLMI.

[6]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[7]  Andreas Stolcke,et al.  The ICSI-SRI Spring 2006 Meeting Recognition System , 2006, MLMI.

[8]  Xavier Anguera Miró,et al.  Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System , 2005, MLMI.

[9]  D. A. van Leeuwen The (TNO) Speaker Diarization System for NIST Rich Transcription Evaluation 2005 for meeting data , 2005 .

[10]  Xavier Anguera Miró,et al.  Purity Algorithms for Speaker Diarization of Meetings Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Xavier Anguera Miró,et al.  Friends and enemies: a novel initialization for speaker diarization , 2006, INTERSPEECH.

[12]  David A. van Leeuwen,et al.  The TNO Speaker Diarization System for NIST RT05s Meeting Data , 2005, MLMI.

[13]  Xavier Anguera Miró,et al.  Automatic Cluster Complexity and Quantity Selection: Towards Robust Speaker Diarization , 2006, MLMI.

[14]  Dan Istrate,et al.  NIST RT'05S Evaluation: Pre-processing Techniques and Speaker Diarization on Multiple Microphone Meetings , 2005, MLMI.