Linguistic influences on bottom-up and top-down clustering for speaker diarization

While bottom-up approaches have emerged as the standard, default approach to clustering for speaker diarization we have always found the top-down approach gives equivalent or superior performance. Our recent work shows that significant gains in performance can be obtained when cluster purification is applied to the output of top-down systems but that it can degrade performance when applied to the output of bottom-up systems. This paper demonstrates that these observations can be accounted for by factors unrelated to the speaker and that they can impact more strongly on the performance of bottom-up clustering strategies than top-down strategies. Experimental results confirm that clusters produced through top-down clustering are better normalized against phone variation than those produced through bottom-up clustering and that this accounts for the observed inconsistencies in purification performance. The work highlights the need for marginalization strategies which should encourage convergence toward different speakers rather than toward nuisance factors such as that those related to the linguistic content.

[1]  Haizhou Li,et al.  T-test distance and clustering criterion for speaker diarization , 2008, INTERSPEECH.

[2]  Nicholas W. D. Evans,et al.  A multimodal approach to initialisation for top-down speaker diarization of television shows , 2010, 2010 18th European Signal Processing Conference.

[3]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[4]  Xavier Anguera Miró,et al.  Purity Algorithms for Speaker Diarization of Meetings Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  Nicholas W. D. Evans,et al.  The lia-eurecom RT'09 speaker diarization system: Enhancements in speaker modelling and cluster purification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Christian A. Müller,et al.  Prosodic and other Long-Term Features for Speaker Diarization , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[8]  Chuck Wooters,et al.  FRAME PURIFICATION FOR CLUSTER COMPARISON IN SPEAKER DIARIZATION , 2006 .

[9]  Dong Wang,et al.  An integrated top-down/bottom-up approach to speaker diarization , 2010, INTERSPEECH.

[10]  Jean-François Bonastre,et al.  ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[11]  Bin Ma,et al.  Speaker diarization system for RT07 and RT09 meeting room audio , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Shrikanth S. Narayanan,et al.  Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  Bin Ma,et al.  Speaker diarization for meeting room audio , 2009, INTERSPEECH.