Overlap detection for speaker diarization by fusing spectral and spatial features

A substantial portion of errors of the conventional speaker diarization systems on meeting data can be accounted to overlapped speech. This paper proposes the use of several spatial features to improve speech overlap detection on distant channel microphones. These spatial features are integrated into a spectral-based system by using principal component analysis and neural networks. Different overlap detection hypotheses are used to improve diarization performance with both overlap exclusion and overlap labeling. In experiments conducted on AMI Meeting Corpus we demonstrate a relative DER improvement of 11.6% and 14.6% for single- and multi-site data, respectively.

[1]  Iain McCowan,et al.  Location based speaker segmentation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Scott Otterson Improved location features for meeting speaker diarization , 2007, INTERSPEECH.

[3]  Jordi Luque,et al.  Speaker Diarization for Conference Room: The UPC RT07s Evaluation System , 2007, CLEAR.

[4]  Gerald Friedland,et al.  Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech , 2008, INTERSPEECH.

[5]  Tanja Schultz,et al.  Crosscorrelation-based multispeaker speech activity detection , 2004, INTERSPEECH.

[6]  Elizabeth Shriberg,et al.  Spontaneous speech: how people really talk and why engineers should care , 2005, INTERSPEECH.

[7]  Maurizio Omologo,et al.  Acoustic source location in a three-dimensional space using crosspower spectrum phase , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[9]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[10]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  David A. van Leeuwen,et al.  The AMI Speaker Diarization System for NIST RT06s Meeting Data , 2006, MLMI.