Speaker diarization using eye-gaze information in multi-party conversations

We present a novel speaker diarization method by using eyegaze information in multi-party conversations. In real environments, speaker diarization or speech activity detection of each participant of the conversation is challenging because of distant talking and ambient noise. In contrast, eye-gaze information is robust against acoustic degradation, and it is presumed that eyegaze behavior plays an important role in turn-taking and thus in predicting utterances. The proposed method stochastically integrates eye-gaze information with acoustic information for speaker diarization. Specifically, three models are investigated for multi-modal integration in this paper. Experimental evaluations in real poster sessions demonstrate that the proposed method improves accuracy of speaker diarization from the baseline acoustic method. Index Terms: speaker diarization, multi-modal interaction, eye-gaze

[1]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[2]  Daniel Gatica-Perez,et al.  Automatic nonverbal analysis of social interaction in small groups: A review , 2009, Image Vis. Comput..

[3]  Masataka Goto,et al.  Real-time sound source localization and separation system and its application to automatic speech recognition , 2001, INTERSPEECH.

[4]  Gerald Friedland,et al.  The ICSI RT-09 Speaker Diarization System , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[6]  Lucas C. Parra,et al.  A SURVEY OF CONVOLUTIVE BLIND SOURCE SEPARATION METHODS , 2007 .

[7]  Tatsuya Kawahara,et al.  Smart posterboard: Multi-modal sensing and analysis of poster conversations , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[8]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Tatsuya Kawahara,et al.  Prediction of Turn-Taking by Combining Prosodic and Eye-Gaze Information in Poster Conversations , 2012, INTERSPEECH.

[10]  Kazuhiro Otsuka Multimodal Conversation Scene Analysis for Understanding People's Communicative Behaviors in Face-to-Face Meetings , 2011, HCI.

[11]  Yuichi Nakamura,et al.  Cubistic Representation for Real-Time 3D Shape and Pose Estimation of Unknown Rigid Object , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[12]  Masafumi Nishida,et al.  Turn-alignment using eye-gaze and speech in conversational interaction , 2010, INTERSPEECH.

[13]  Tatsuya Kawahara,et al.  Estimation of interest and comprehension level of audience through multi-modal behaviors in poster conversations , 2013, INTERSPEECH.

[14]  S. Araki,et al.  A DOA Based Speaker Diarization System for Real Meetings , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.