Efficient use of overlap information in speaker diarization

Speaker overlap in meetings is thought to be a significant contributor to error in speaker diarization, but it is not clear if overlaps are problematic for speaker clustering and/or if errors could be addressed by assigning multiple labels in overlap regions. In this paper, we look at these issues experimentally, assuming perfect detection of overlaps, to assess the relative importance of these problems and the potential impact of overlap detection. With our best features, we find that detecting overlaps could potentially improve diarization accuracy by 15% relative, using a simple strategy of assigning speaker labels in overlap regions according to the labels of the neighboring segments. In addition, the use of cross-correlation features with MFCC's reduces the performance gap due to overlaps, so that there is little gain from removing overlapped regions before clustering.

[1]  Guy J. Brown,et al.  A comparison of auditory and blind separation techniques for speech segregation , 2001, IEEE Trans. Speech Audio Process..

[2]  Birger Kollmeier,et al.  Amplitude Modulation Decorrelation For Convolutive Blind Source Separation , 2000 .

[3]  Ravi P. Ramachandran,et al.  Cochannel speaker count labelling based on the use of cepstral and pitch prediction derived features , 2001, Pattern Recognit..

[4]  Scott Otterson Improved location features for meeting speaker diarization , 2007, INTERSPEECH.

[5]  Raffaele Parisi,et al.  Multi-source localization in reverberant environments , 2000, 2000 10th European Signal Processing Conference.

[6]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[7]  Andreas Ziehe,et al.  Combining time-delayed decorrelation and ICA: towards solving the cocktail party problem , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Sridha Sridharan,et al.  Multichannel speech separation by eigendecomposition and its application to co-talker interference removal , 1997, IEEE Trans. Speech Audio Process..

[9]  H. Attias,et al.  Blind source separation and deconvolution by dynamic component analysis , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[10]  David A. van Leeuwen,et al.  The AMI Speaker Diarization System for NIST RT06s Meeting Data , 2006, MLMI.

[11]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[12]  Stanley J. Wenndt,et al.  Use of local kurtosis measure for spotting usable speech segments in co-channel speech , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Andreas Stolcke,et al.  The Meeting Project at ICSI , 2001, HLT.

[14]  Shoko Araki,et al.  The fundamental limitation of frequency domain blind source separation for convolutive mixtures of speech , 2003, IEEE Trans. Speech Audio Process..

[15]  Stanley J. Wenndt,et al.  Adjacent pitch period comparison (appc) as a usability measure of speech segments under co-channel conditions , 2001 .

[16]  Shoko Araki,et al.  Fundamental limitation of frequency domain blind source separation for convolutive mixture of speech , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[17]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[18]  Xavier Anguera Miró,et al.  Robust speaker diarization for meetings: ICSI RT06s evaluation system , 2006, INTERSPEECH.

[19]  David A. van Leeuwen,et al.  Progress in the AMIDA Speaker Diarization System for Meeting Data , 2007, CLEAR.

[20]  John S. Garofolo,et al.  THE RICH TRANSCRIPTION 2004 SPRING MEETING RECOGNITION EVALUATION , 2004 .