An Information Theoretic Combination of MFCC and TDOA Features for Speaker Diarization

This correspondence describes a novel system for speaker diarization of meetings recordings based on the combination of acoustic features (MFCC) and time delay of arrivals (TDOAS). The first part of the paper analyzes differences between MFCC and TDOA features which possess completely different statistical properties. When Gaussian mixture models are used, experiments reveal that the diarization system is sensitive to the different recording scenarios (i.e., meeting rooms with varying number of microphones). In the second part, a new multistream diarization system is proposed extending previous work on information theoretic diarization. Both speaker clustering and speaker realignment steps are discussed; in contrary to current systems, the proposed method avoids to perform the feature combination averaging log-likelihood scores. Experiments on meetings data reveal that the proposed approach outperforms the GMM-based system when the recording is done with varying number of microphones.

[1]  Xavier Anguera Miró,et al.  Speaker Diarization For Multiple-Distant-Microphone Meetings Using Several Sources of Information , 2007, IEEE Transactions on Computers.

[2]  Hervé Bourlard,et al.  Robust speaker change detection , 2004, IEEE Signal Processing Letters.

[3]  David A. van Leeuwen,et al.  The AMI Speaker Diarization System for NIST RT06s Meeting Data , 2006, MLMI.

[4]  S. Chen,et al.  Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion , 1998 .

[5]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[6]  José Manuel Pardo,et al.  Robust Speaker Diarization for meetings , 2006 .

[7]  Jonathan G. Fiscus,et al.  The Rich Transcription 2007 Meeting Recognition Evaluation , 2007, CLEAR.

[8]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[9]  Iain McCowan,et al.  Location based speaker segmentation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[11]  X. Anguera,et al.  Speaker diarization for multi-party meetings using acoustic fusion , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[12]  Guillermo Aradilla Acoustic Models for Posterior Features in Speech Recognition , 2008 .

[13]  Naftali Tishby,et al.  The Information Bottleneck Revisited or How to Choose a Good Distortion Measure , 2007, 2007 IEEE International Symposium on Information Theory.

[14]  Xavier Anguera Miró,et al.  Robust Speaker Segmentation for Meetings: The ICSI-SRI Spring 2005 Diarization System , 2005, MLMI.

[15]  Jitendra Ajmera,et al.  Robust audio segmentation , 2004 .

[16]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[17]  Fabio Valente,et al.  An Information Theoretic Approach to Speaker Diarization of Meeting Data , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  David A. van Leeuwen,et al.  The TNO Speaker Diarization System for NIST RT05s Meeting Data , 2005, MLMI.

[19]  Noam Slonim,et al.  The Information Bottleneck : Theory and Applications , 2006 .

[20]  Xavier Anguera Miró,et al.  Speaker Diarization for Multi-microphone Meetings Using Only Between-Channel Differences , 2006, MLMI.

[21]  David A. van Leeuwen,et al.  Progress in the AMIDA Speaker Diarization System for Meeting Data , 2007, CLEAR.