论文信息 - A DOA Based Speaker Diarization System for Real Meetings

A DOA Based Speaker Diarization System for Real Meetings

This paper presents a speaker diarization system that estimates who spoke when in a meeting. Our proposed system is realized by using a noise robust voice activity detector (VAD), a direction of arrival (DOA) estimator, and a DOA classifier. Our previous system utilized the generalized cross correlation method with the phase transform (GCC-PHAT) approach for the DOA estimation. Because the GCC-PHAT can estimate just one DOA per frame, it was difficult to handle speaker overlaps. This paper tries to deal with this issue by employing a DOA at each time-frequency slot (TFDOA), and reports how it improves diarization performance for real meetings / conversations recorded in a room with a reverberation time of 350 ms.

S. Araki | M. Fujimoto | K. Ishizuka | H. Sawada | S. Makino

[1] Masakiyo Fujimoto,et al. A voice activity detection based on the adaptive integration of multiple speech features and a signal decision scheme , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Masakiyo Fujimoto,et al. Noise robust front-end processing with voice activity detection based on periodic to aperiodic component ratio , 2007, INTERSPEECH.

[3] Daniel P. W. Ellis,et al. Speaker turn segmentation based on between-channel differences , 2004 .

[4] Hiroshi Sawada,et al. Doa Estimation for Multiple Sparse Sources with Normalized Observation Vector Clustering , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[5] G. Carter,et al. The generalized correlation method for estimation of time delay , 1976 .

[6] Climent Nadeu,et al. Automatic Speech Activity Detection, Source Localization, and Speech Recognition on the Chil Seminar Corpus , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[7] Masakiyo Fujimoto,et al. Speaker indexing and speech enhancement in real meetings / conversations , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8] Masakiyo Fujimoto,et al. A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization , 2008, ICMI '08.

[9] Masakiyo Fujimoto,et al. Noise Robust Voice Activity Detection Based on Switching Kalman Filter , 2008, IEICE Trans. Inf. Syst..

[10] Xavier Anguera Miró,et al. Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11] Carlos Busso,et al. Real-Time Monitoring of Participants' Interaction in a Meeting using Audio-Visual Sensors , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[12] Alex Acero,et al. Microphone Array Post-Filter using Incremental Bayes Learning to Track the Spatial Distributions of Speech and Noise , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.