Fusion of Acoustic and Prosodic Features for Speaker Clustering

This work focus on a speaker clustering methods that are used in speaker diarization systems. The purpose of speaker clustering is to associate together segments that belong to the same speakers. It is usually applied in the last stage of the speaker-diarization process. We concentrate on developing of proper representations of speaker segments for clustering and explore different similarity measures for joining speaker segments together. We realize two different competitive systems. The first is a standard approach using a bottom-up agglomerative clustering principle with the Bayesian Information Criterion (BIC) as a merging criterion. In the next approach a fusion speaker clustering system is developed, where the speaker segments are modeled by acoustic and prosody representations. The idea here is to additionally model the speaker prosody characteristics and add it to basic acoustic information estimated from the speaker segments. We construct 10 basic prosody features derived from the energy of the audio signals, the estimated pitch contours, and the recognized voiced and unvoiced regions in speech. In this way we impose higher-level information in the representations of the speaker segments, which leads to improved clustering of the segments in the case of similar speaker acoustic characteristics or poor acoustic conditions.

[1]  Jean-Luc Gauvain,et al.  Combining speaker identification and BIC for speaker diarization , 2005, INTERSPEECH.

[2]  Michael I. Posner,et al.  Cognition (2nd ed.). , 1987 .

[3]  Kristian Kroschel,et al.  Robust Speech Recognition and Understanding , 2007 .

[4]  Jean-François Bonastre,et al.  Evolutive HMM for multi-speaker tracking system , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Philip C. Woodland,et al.  The development of the HTK Broadcast News transcription system: An overview , 2002, Speech Commun..

[6]  Janez Žibert Novel Approaches to Speaker Clustering for Speaker Diarization in Audio Broadcast News Data , 2008 .

[7]  Andreas Stolcke,et al.  Modeling prosodic feature sequences for speaker recognition , 2005, Speech Commun..

[8]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Christian Wellekens,et al.  A speaker tracking system based on speaker turn detection for NIST evaluation , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[10]  João Paulo da Silva Neto,et al.  The COST278 broadcast news segmentation and speaker clustering evaluation - overview, methodology, systems, results , 2005, INTERSPEECH.

[11]  Janez Zibert,et al.  Novel Approaches to Speech Detection in the Processing of Continuous Audio Streams , 2007 .

[12]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[13]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[14]  Elmar Nöth,et al.  Integrated recognition of words and prosodic phrase boundaries , 2002, Speech Commun..

[15]  France Mihelic,et al.  Development of Slovenian Broadcast News Speech Database , 2004, LREC.