Investigating the use of visual focus of attention for audio-visual speaker diarisation

Audio-visual speaker diarisation is the task of estimating ``who spoke when'' using audio and visual cues. In this paper we propose the combination of an audio diarisation system with psychology inspired visual features, reporting experiments on multiparty meetings, a challenging domain characterised by unconstrained interaction and participant movements. More precisely the role of gaze in coordinating speaker turns was exploited by the use of Visual Focus of Attention features. Experiments were performed both with the reference and 3 automatic VFoA estimation systems, based on head pose and visual activity cues, of increasing complexity. VFoA features yielded consistent speaker diarisation improvements in combination with audio features using a multi-stream approach.

[1]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[3]  Marijn Huijbregts,et al.  The ICSI RT07s Speaker Diarization System , 2007, CLEAR.

[4]  Gatica-PerezDaniel Automatic nonverbal analysis of social interaction in small groups , 2009 .

[5]  Chuohao Yeo,et al.  Multi-modal speaker diarization of real-world meetings using compressed-domain video features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Daniel Gatica-Perez,et al.  Automatic nonverbal analysis of social interaction in small groups: A review , 2009, Image Vis. Comput..

[7]  Ben J. A. Kröse,et al.  On-line multi-modal speaker diarization , 2007, ICMI '07.

[8]  Jean Carletta,et al.  Nonverbal behaviours improving a simulation of small group discussion , 2003 .

[9]  Jean-Marc Odobez,et al.  Visual activity context for focus of attention estimation in dynamic meetings , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[10]  David G. Novick,et al.  Coordinating turn-taking with gaze , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[11]  X. Anguera,et al.  Speaker diarization for multi-party meetings using acoustic fusion , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[12]  Gerhard Rigoll,et al.  Action Recognition in Meeting Scenarios using Global Motion Features , 2003 .

[13]  Anton Nijholt,et al.  Eye gaze patterns in conversations: there is more to conversational agents than meets the eyes , 2001, CHI.

[14]  Masakiyo Fujimoto,et al.  A realtime multimodal system for analyzing group meetings by combining face pose tracking and speaker diarization , 2008, ICMI '08.