A DOA Based Speaker Diarization System for Real Meetings

This paper presents a speaker diarization system that estimates who spoke when in a meeting. Our proposed system is realized by using a noise robust voice activity detector (VAD), a direction of arrival (DOA) estimator, and a DOA classifier. Our previous system utilized the generalized cross correlation method with the phase transform (GCC-PHAT) approach for the DOA estimation. Because the GCC-PHAT can estimate just one DOA per frame, it was difficult to handle speaker overlaps. This paper tries to deal with this issue by employing a DOA at each time-frequency slot (TFDOA), and reports how it improves diarization performance for real meetings / conversations recorded in a room with a reverberation time of 350 ms.

[1]  Masakiyo Fujimoto,et al.  A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Masakiyo Fujimoto,et al.  Noise robust front-end processing with voice activity detection based on periodic to aperiodic component ratio , 2007, INTERSPEECH.

[3]  Daniel P. W. Ellis,et al.  Speaker turn segmentation based on between-channel differences , 2004 .

[4]  Hiroshi Sawada,et al.  Doa Estimation for Multiple Sparse Sources with Normalized Observation Vector Clustering , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[6]  Climent Nadeu,et al.  Automatic Speech Activity Detection, Source Localization, and Speech Recognition on the Chil Seminar Corpus , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[7]  Masakiyo Fujimoto,et al.  Speaker indexing and speech enhancement in real meetings / conversations , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Masakiyo Fujimoto,et al.  A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization , 2008, ICMI '08.

[9]  Masakiyo Fujimoto,et al.  Noise Robust Voice Activity Detection Based on Switching Kalman Filter , 2008, IEICE Trans. Inf. Syst..

[10]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Carlos Busso,et al.  Real-Time Monitoring of Participants' Interaction in a Meeting using Audio-Visual Sensors , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12]  Alex Acero,et al.  Microphone Array Post-Filter using Incremental Bayes Learning to Track the Spatial Distributions of Speech and Noise , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.