New implementations of the E-HMM-based system for speaker diarization in meeting rooms

This paper addresses the problem of speaker diarization in the specific context of meeting room recordings. Some new enhancements to the E-HMM-based speaker diarization system are reported. These involve a different approach to speaker modelling utilising EM/ML-based training rather than MAP adaptation as in our previous work. Using the new system we investigate the effects of speech activity detection through speaker diarization experiments conducted on 23 meetings extracted from the NIST/RT evaluation campaign datasets. We propose a new approach, which assigns confidence values according to the type of information carried by the signal and incorporates these values directly into the speaker diarization system. Experimental results show that, perhaps surprisingly, the non-speech segments do not systematically affect the robustness of the speaker diarization system, and more precisely the speaker model training process.