Speaker localization for microphone array-based ASR: the effects of accuracy on overlapping speech

Accurate speaker location is essential for optimal performance of distant speech acquisition systems using microphone array techniques. However, to the best of our knowledge, no comprehensive studies on the degradation of automatic speech recognition (ASR) as a function of speaker location accuracy in a multi-party scenario exist. In this paper, we describe a framework for evaluation of the effects of speaker location errors on a microphone array-based ASR system, in the context of meetings in multi-sensor rooms comprising multiple cameras and microphones. Speakers are manually annotated in videos in different camera views, and triangulation is used to determine an accurate speaker location. Errors in the speaker location are then induced in a systematic manner to observe their influence on speech recognition performance. The system is evaluated on real overlapping speech data collected with simultaneous speakers in a meeting room. The results are compared with those obtained from close-talking headset microphones, lapel microphones, and speaker location based on audio-only and audio-visual information approaches.

[1]  Iain McCowan,et al.  Microphone array speech recognition: experiments on overlapping speech in meetings , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[2]  Satoshi Nakamura,et al.  Detection and separation of speech segment using audio and video information fusion , 2003, INTERSPEECH.

[3]  J. Karam,et al.  Methods in Nucleic Acids Research , 1990 .

[4]  G A Petsko,et al.  Chemistry and biology. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Josef Švejcar,et al.  Péče o dítě. , 1991 .

[6]  Steve Renals,et al.  WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[7]  J. Parker,et al.  Clinical PET and PET/CT. , 2005 .

[8]  Pavel Pavlovský,et al.  Soudní psychiatrie a psychologie , 2004 .

[9]  Jan Hugo,et al.  Velký lékařský slovník. , 2002 .

[10]  Andrew Zisserman,et al.  Multiple view geometry in computer visiond , 2001 .

[11]  Jean-Marc Odobez,et al.  Multimodal multispeaker probabilistic tracking in meetings , 2005, ICMI '05.

[12]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[13]  J. Foote,et al.  WSJCAM0: A BRITISH ENGLISH SPEECH CORPUS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION , 1995 .

[14]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[15]  Zdeněk Fišar,et al.  Vybrané kapitoly z biologické psychiatrie , 2001 .

[16]  James L. Crowley,et al.  Multi-modal tracking of faces for video communications , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[17]  John W. McDonough,et al.  Microphone Array Driven Speech Recognition: Influence of Localization on the Word Error Rate , 2005, MLMI.

[18]  M. Schneider,et al.  Introduction to Public Health , 1988 .

[19]  Daniel Gatica-Perez,et al.  Speech Acquisition in Meetings with an Audio-Visual Sensor Array , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[20]  Naoyuki Ichimura,et al.  Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech Interface , 2004, EURASIP J. Adv. Signal Process..

[21]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[22]  Anoop Gupta,et al.  Distributed meetings: a meeting capture and broadcasting system , 2002, MULTIMEDIA '02.

[23]  R. Shulman,et al.  Enteral and parenteral nutrition. , 2002 .

[24]  Lukás Burget,et al.  The AMI System for the Transcription of Speech in Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[25]  Andreas Stolcke,et al.  Observations on overlap: findings and implications for automatic processing of multi-party conversation , 2001, INTERSPEECH.

[26]  Emanuel a spol. Nečas,et al.  Obecná patologická fyziologie , 2000 .

[27]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[28]  Iain McCowan,et al.  A sector-based approach for localization of multiple speakers with microphone arrays , 2004, SAPA@INTERSPEECH.