Impact of overlapping speech detection on speaker diarization for broadcast news and debates

The overlapping speech detection systems developped by Orange and LIMSI for the ETAPE evaluation campaign on French broadcast news and debates are described. Using either cepstral features or a multi-pitch analysis, a F1-measure for overlapping speech detection up to 59.2% is reported on the TV data of the ETAPE evaluation set, where 6.7% of the speech was measured as overlapping, ranging from 1.2% in the news to 10.4% in the debates. Overlapping speech segments were excluded during the speaker diarization stage, and these segments were further labelled with the two nearest speaker labels, taking into account the temporal distance. We describe the effects of this strategy for various overlapping speech systems and we show that it improves the diarization error rate in all situations and up to 26.1% relative in our best configuration.

[1]  Guy J. Brown,et al.  Speech and crosstalk detection in multichannel audio , 2005, IEEE Transactions on Speech and Audio Processing.

[2]  Mari Ostendorf,et al.  Efficient use of overlap information in speaker diarization , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[3]  Xavier Anguera Miró,et al.  Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences , 2006, INTERSPEECH.

[4]  David A. van Leeuwen,et al.  Speech overlap detection in a two-pass speaker diarization system , 2009, INTERSPEECH.

[5]  Javier Hernando,et al.  The Detection of Overlapping Speech with Prosodic Features for Speaker Diarization , 2011, INTERSPEECH.

[6]  Dong Wang,et al.  Speech overlap detection and attribution using convolutive non-negative sparse coding , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jean-Luc Gauvain,et al.  Multistage speaker diarization of broadcast news , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Itshak Lapidot,et al.  Frame level entropy based overlapped speech detection as a pre-processing stage for speaker diarization , 2009, 2009 IEEE International Workshop on Machine Learning for Signal Processing.

[9]  Marijn Huijbregts,et al.  The blame game: performance analysis of speaker diarization system components , 2007, INTERSPEECH.

[10]  Tanja Schultz,et al.  Unsupervised Learning of Overlapped Speech Model Parameters For Multichannel Speech Activity Detection in Meetings , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Olivier Galibert,et al.  The ETAPE corpus for the evaluation of speech-based TV content processing in the French language , 2012, LREC.

[12]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[13]  Gerald Friedland,et al.  Improved Overlapped Speech Handling for Speaker Diarization , 2011, INTERSPEECH.

[14]  P. Mowlaee,et al.  A MAP criterion for detecting the number of speakers at frame level in model-based single-channel speech separation , 2010, 2010 Conference Record of the Forty Fourth Asilomar Conference on Signals, Systems and Computers.

[15]  Fabio Valente,et al.  Speaker diarization of overlapping speech based on silence distribution in meeting recordings , 2012, INTERSPEECH.

[16]  Gerald Friedland,et al.  Two's a crowd: improving speaker diarization by automatically identifying and excluding overlapped speech , 2008, INTERSPEECH.

[17]  Tomi Kinnunen,et al.  Improving monaural speaker identification by double-talk detection , 2010, INTERSPEECH.

[18]  Patrick Paroubek,et al.  Speech Overlap and Interplay with Disfluencies in Political Interviews , 2007 .

[19]  J. Liénard,et al.  Using sets of combs to control pitch estimation errors , 2008 .

[20]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Andreas Stolcke,et al.  Multispeaker speech activity detection for the ICSI meeting recorder , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..