A Comparative Study of Bottom-Up and Top-Down Approaches to Speaker Diarization

This paper presents a theoretical framework to analyze the relative merits of the two most general, dominant approaches to speaker diarization involving bottom-up and top-down hierarchical clustering. We present an original qualitative comparison which argues how the two approaches are likely to exhibit different behavior in speaker inventory optimization and model training: bottom-up approaches will capture comparatively purer models and will thus be more sensitive to nuisance variation such as that related to the speech content; top-down approaches, in contrast, will produce less discriminative speaker models but, importantly, models which are potentially better normalized against nuisance variation. We report experiments conducted on two standard, single-channel NIST RT evaluation datasets which validate our hypotheses. Results show that competitive performance can be achieved with both bottom-up and top-down approaches (average DERs of 21% and 22%), and that neither approach is superior. Speaker purification, which aims to improve speaker discrimination, gives more consistent improvements with the top-down system than with the bottom-up system (average DERs of 19% and 25%), thereby confirming that the top-down system is less discriminative and that the bottom-up system is less stable. Finally, we report a new combination strategy that exploits the merits of the two approaches. Combination delivers an average DER of 17% and confirms the intrinsic complementary of the two approaches.

[1]  Haizhou Li,et al.  T-test distance and clustering criterion for speaker diarization , 2008, INTERSPEECH.

[2]  Dong Wang,et al.  An integrated top-down/bottom-up approach to speaker diarization , 2010, INTERSPEECH.

[3]  Nicholas W. D. Evans,et al.  The LIA RT'07 Speaker Diarization System , 2007, CLEAR.

[4]  Bin Ma,et al.  Speaker diarization for meeting room audio , 2009, INTERSPEECH.

[5]  Xavier Anguera Miró,et al.  Purity Algorithms for Speaker Diarization of Meetings Data , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Joachim Köhler,et al.  Improvement speaker clustering using global similarity features , 2006, INTERSPEECH.

[7]  David A. van Leeuwen,et al.  The majority wins: a method for combining speaker diarization systems , 2009, INTERSPEECH.

[8]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Fabio Valente,et al.  Combination of agglomerative and sequential clustering for speaker diarization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Bin Ma,et al.  Speaker diarization system for RT07 and RT09 meeting room audio , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Jordi Luque,et al.  Speaker Diarization for Conference Room: The UPC RT07s Evaluation System , 2007, CLEAR.

[13]  Sue Tranter Two-way cluster voting to improve speaker diarisation performance , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[14]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[15]  Jean-Luc Gauvain,et al.  Multi-stage Speaker Diarization for Conference and Lecture Meetings , 2007, CLEAR.

[16]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Xavier Anguera Miró,et al.  Robust speaker diarization for meetings: ICSI RT06s evaluation system , 2006, INTERSPEECH.

[18]  David A. van Leeuwen,et al.  Progress in the AMIDA Speaker Diarization System for Meeting Data , 2007, CLEAR.

[19]  Bin Ma,et al.  Speaker Diarization Using Direction of Arrival Estimate and Acoustic Feature Information: The I2R-NTU Submission for the NIST RT 2007 Evaluation , 2007, CLEAR.

[20]  M. A. Siegler,et al.  Automatic Segmentation, Classification and Clustering of Broadcast News Audio , 1997 .

[21]  Chuck Wooters,et al.  FRAME PURIFICATION FOR CLUSTER COMPARISON IN SPEAKER DIARIZATION , 2006 .

[22]  Frédéric Bimbot,et al.  Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs , 2004, INTERSPEECH.

[23]  Jean Carletta,et al.  The AMI meeting corpus , 2005 .

[24]  Jean-François Bonastre,et al.  E-HMM approach for learning and adapting sound models for speaker indexing , 2001, Odyssey.

[25]  Nicholas W. D. Evans,et al.  The lia-eurecom RT'09 speaker diarization system: Enhancements in speaker modelling and cluster purification , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[27]  Nikki Mirghafori,et al.  Nuts and Flakes: a Study of Data Characteristics in Speaker Diarization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[28]  Jean-François Bonastre,et al.  ALIZE, a free toolkit for speaker recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[29]  Gerald Friedland,et al.  Tuning-Robust Initialization Methods for Speaker Diarization , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[30]  Jean-François Bonastre,et al.  Step-by-step and integrated approaches in broadcast news speaker diarization , 2006, Comput. Speech Lang..

[31]  X. Anguera,et al.  Speaker diarization for multi-party meetings using acoustic fusion , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[32]  Christian A. Müller,et al.  Prosodic and other Long-Term Features for Speaker Diarization , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[33]  Nicholas W. D. Evans,et al.  System output combination for improved speaker diarization , 2010, INTERSPEECH.

[34]  Shrikanth S. Narayanan,et al.  Strategies to Improve the Robustness of Agglomerative Hierarchical Clustering Under Data Source Variation for Speaker Diarization , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[35]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Patrick Kenny,et al.  Combining Gaussianized/Non-Gaussianized Features to Improve Speaker Diarization of Telephone Conversations , 2007, IEEE Signal Processing Letters.

[37]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.