Speech overlap detection in a two-pass speaker diarization system

In this paper we present the two-pass speaker diarization system that we developed for the NIST RT09s evaluation. In the first pass of our system a model for speech overlap detection is generated automatically. This model is used in two ways to reduce the diarization errors due to overlapping speech. First, it is used in a second diarization pass to remove overlapping speech from the data while training the speaker models. Second, it is used to find speech overlap for the final segmentation so that overlapping speech segments can be generated. The experiments show that our overlap detection method improves the performance of all three of our system configurations. Index Terms: Speaker diarization, speech overlap detection, Benchmark

[1]  Marijn Huijbregts,et al.  The blame game: performance analysis of speaker diarization system components , 2007, INTERSPEECH.

[2]  Hynek Hermansky,et al.  Qualcomm-ICSI-OGI features for ASR , 2002, INTERSPEECH.

[3]  David A. van Leeuwen,et al.  The AMI Speaker Diarization System for NIST RT06s Meeting Data , 2006, MLMI.

[4]  Xavier Anguera Miró ROBUST SPEAKER DIARIZATION FOR MEETINGS , 2006 .

[5]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[6]  Hervé Bourlard,et al.  Unknown-multiple speaker clustering using HMM , 2002, INTERSPEECH.

[7]  Xavier Anguera Miró,et al.  Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences , 2006, INTERSPEECH.

[8]  José Manuel Pardo,et al.  Robust Speaker Diarization for meetings , 2006 .

[9]  Douglas A. Reynolds,et al.  Approaches and applications of audio diarization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[10]  Gerald Friedland,et al.  Overlapped speech detection for improved speaker diarization in multiparty meetings , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Roeland Ordelman,et al.  Filtering the unknown: speech activity detection in heterogeneous video collections , 2007, INTERSPEECH.

[12]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[13]  Jonathan G. Fiscus,et al.  The Rich Transcription 2007 Meeting Recognition Evaluation , 2007, CLEAR.