Tuning-Robust Initialization Methods for Speaker Diarization

This paper investigates a typical speaker diarization system regarding its robustness against initialization parameter variation and presents a method to reduce manual tuning of these values significantly. The behavior of an agglomerative hierarchical clustering system is studied to determine which initialization parameters impact accuracy most. We show that the accuracy of typical systems is indeed very sensitive to the values chosen for the initialization parameters and factors such as the duration of speech in the recording. We then present a solution that reduces the sensitivity of the initialization values and therefore reduces the need for manual tuning significantly while at the same time increasing the accuracy of the system. For short meetings extracted from the previous (2006, 2007, and 2009) National Institute of Standards and Technology (NIST) Rich Transcription (RT) evaluation data, the decrease of the diarization error rate is up to 50% relative. The approach consists of a novel initialization parameter estimation method for speaker diarization that uses agglomerative clustering with Bayesian information criterion (BIC) and Gaussian mixture models (GMMs) of frame-based cepstral features (MFCCs). The estimation method balances the relationship between the optimal value of the seconds of speech data per Gaussian and the duration of the speech data and is combined with a novel nonuniform initialization method. This approach results in a system that performs better than the current ICSI baseline engine on datasets of the NIST RT evaluations of the years 2006, 2007, and 2009.

[1]  Jordi Luque,et al.  Speaker Diarization for Conference Room: The UPC RT07s Evaluation System , 2007, CLEAR.

[2]  Xavier Anguera Miró,et al.  Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System , 2005, MLMI.

[3]  Gerald Friedland,et al.  Live speaker identification in conversations , 2008, ACM Multimedia.

[4]  Gerald Friedland,et al.  Robust Speaker Diarization for short speech recordings , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[6]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[9]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[10]  Christian A. Müller,et al.  A fast-match approach for robust, faster than real-time speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  David A. van Leeuwen,et al.  Progress in the AMIDA Speaker Diarization System for Meeting Data , 2007, CLEAR.

[12]  Nicholas W. D. Evans,et al.  The LIA RT'07 Speaker Diarization System , 2007, CLEAR.

[13]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[14]  Johanna Smeyers-Verbeke,et al.  Visual presentation of data by means of box plots , 2005 .

[15]  José Manuel Pardo,et al.  Robust Speaker Diarization for meetings , 2006 .

[16]  Jean-Luc Gauvain,et al.  Multi-stage Speaker Diarization for Conference and Lecture Meetings , 2007, CLEAR.

[17]  David Imseng Novel initialization methods for Speaker Diarization , 2009 .

[18]  Ian Witten,et al.  Data Mining , 2000 .

[19]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[20]  Jitendra Ajmera,et al.  A robust speaker clustering algorithm , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[21]  Christian A. Müller,et al.  Prosodic and other Long-Term Features for Speaker Diarization , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Chuck Wooters,et al.  Robust Speaker Diarization for meetings , 2006 .

[23]  Bin Ma,et al.  Speaker Diarization Using Direction of Arrival Estimate and Acoustic Feature Information: The I2R-NTU Submission for the NIST RT 2007 Evaluation , 2007, CLEAR.

[24]  Nikki Mirghafori,et al.  Nuts and Flakes: a Study of Data Characteristics in Speaker Diarization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[25]  Fabio Valente,et al.  Agglomerative information bottleneck for speaker diarization of meetings data , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).