Large-Scale Speaker Diarization for Long Recordings and Small Collections

Performing speaker diarization of very long recordings is a problem for most diarization systems that are based on agglomerative clustering with an hidden Markov model (HMM) topology. Performing collection-wide speaker diarization, where each speaker is identified uniquely across the entire collection, is even a more challenging task. In this paper we propose a method with which it is possible to efficiently perform diarization of long recordings. We have also applied this method successfully to a collection of a total duration of approximately 15 hours. The method consists of first segmenting long recordings into smaller chunks on which diarization is performed. Next, a speaker detection system is used to link the speech clusters from each chunk and to assign a unique label to each speaker in the long recording or in the small collection. We show for three different audio collections that it is possible to perform high-quality diarization with this approach. The long meetings from the ICSI corpus are processed 5.5 times faster than the originally needed time and by uniquely labeling each speaker across the entire collection it becomes possible to perform speaker-based information retrieval with high accuracy (mean average precision of 0.57).

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Jonathan G. Fiscus,et al.  The Rich Transcription 2007 Meeting Recognition Evaluation , 2007, CLEAR.

[3]  Nelleke Oostdijk,et al.  The spoken Dutch Corpus. Outline and first evaluation , 2000 .

[4]  Douglas A. Reynolds,et al.  Diarization of Telephone Conversations Using Factor Analysis , 2010, IEEE Journal of Selected Topics in Signal Processing.

[5]  David A. van Leeuwen,et al.  The AMI Speaker Diarization System for NIST RT06s Meeting Data , 2006, MLMI.

[6]  David A. van Leeuwen Speaker linking in large data sets , 2010, Odyssey.

[7]  David A. van Leeuwen,et al.  Results of the n-best 2008 dutch speech recognition evaluation , 2009, INTERSPEECH.

[8]  Luis Javier Rodríguez-Fuentes,et al.  A Simple But Effective Approach to Speaker Tracking in Broadcast News , 2007, IbPRIA.

[9]  Patrick Kenny,et al.  A Study of Interspeaker Variability in Speaker Verification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Hervé Bourlard,et al.  Unknown-multiple speaker clustering using HMM , 2002, INTERSPEECH.

[11]  Marijn Huijbregts,et al.  Segmentation, diarization and speech transcription : surprise data unraveled , 2008 .

[12]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  José Manuel Pardo,et al.  Robust Speaker Diarization for meetings , 2006 .

[14]  Patrick Kenny,et al.  Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification , 2009, INTERSPEECH.

[15]  Gerald Friedland,et al.  Robust Speaker Diarization for short speech recordings , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[16]  Xavier Anguera Miró ROBUST SPEAKER DIARIZATION FOR MEETINGS , 2006 .

[17]  David A. van Leeuwen,et al.  The TNO Speaker Diarization System for NIST RT05s Meeting Data , 2005, MLMI.

[18]  David A. van Leeuwen,et al.  The RU Submission to the Evalita’09 “application track” Speaker Recognition Evaluation , 2009 .

[19]  Andreas Stolcke,et al.  The ICSI Meeting Project: Resources and Research , 2004 .

[20]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[21]  Douglas A. Reynolds,et al.  Approaches to Speaker Detection and Tracking in Conversational Speech , 2000, Digit. Signal Process..

[22]  Roeland Ordelman,et al.  Filtering the unknown: speech activity detection in heterogeneous video collections , 2007, INTERSPEECH.

[23]  Alvin F. Martin,et al.  The NIST 1999 Speaker Recognition Evaluation - An Overview , 2000, Digit. Signal Process..

[24]  Lukás Burget,et al.  Comparison of scoring methods used in speaker recognition with Joint Factor Analysis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Douglas A. Reynolds,et al.  The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective , 2000, Speech Commun..

[26]  Jean-Luc Gauvain,et al.  Speaker diarization from speech transcripts , 2004, INTERSPEECH.

[27]  France Mihelic,et al.  A System for Speaker Detection and Tracking in Audio Broadcast News , 2008, Informatica.