Enhanced speaker diarization with detection of backchannels using eye-gaze information in poster conversations

We propose multi-modal speaker diarization using acoustic and eye-gaze information in poster conversations. Eye-gaze information plays an important role in turn-taking, thus it is useful for predicting speech activity. In this paper, a variety of eyegaze features are elaborated and combined with the acoustic information by the multi-modal integration model. Moreover, we introduce another model to detect backchannels, which involve different eye-gaze behaviors. This enhances the diarization result by filtering meaningful utterances such as questions and comments. Experimental evaluations in real poster sessions demonstrate that eye-gaze information contributes to improvement of diarization accuracy under noisy environments, and its weight is automatically determined according to the Signal-toNoise Ratio (SNR). Index Terms: speaker diarization, backchannel, multi-modal, eye-gaze, poster conversation

[1]  Hiroshi G. Okuno,et al.  A Speaker Diarization System with Robust Speaker Localization and Voice Activity Detection , 2013 .

[2]  Masafumi Nishida,et al.  Turn-alignment using eye-gaze and speech in conversational interaction , 2010, INTERSPEECH.

[3]  Junji Yamato,et al.  Analysis and modeling of next speaking start timing based on gaze behavior in multi-party meetings , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Hiroshi Sawada,et al.  Probabilistic Speaker Diarization With Bag-of-Words Representations of Speaker Angle Information , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Gerald Friedland,et al.  The ICSI RT-09 Speaker Diarization System , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Tatsuya Kawahara,et al.  Speaker diarization based on audio-visual integration for smart posterboard , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[7]  A. Ichikawa,et al.  An Analysis of Turn-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs , 1998, Language and speech.

[8]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[9]  Seiichi Nakagawa,et al.  Response Timing Detection Using Prosodic and Linguistic Information for Human-friendly Spoken Dialog Systems (論文特集:人間と共生する情報システム) , 2005 .

[10]  Tatsuya Kawahara,et al.  Prediction of Turn-Taking by Combining Prosodic and Eye-Gaze Information in Poster Conversations , 2012, INTERSPEECH.

[11]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[12]  Louis-Philippe Morency,et al.  A probabilistic multimodal approach for predicting listener backchannels , 2009, Autonomous Agents and Multi-Agent Systems.

[13]  Takeshi Yamada,et al.  Detection of Overlapping Speech in Meetings Using Support Vector Machines and Support Vector Regression , 2006, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[14]  Douglas A. Reynolds,et al.  A study of new approaches to speaker diarization , 2009, INTERSPEECH.

[15]  S. Duncan,et al.  Some Signals and Rules for Taking Speaking Turns in Conversations , 1972 .

[16]  Tatsuya Kawahara,et al.  Speaker diarization using eye-gaze information in multi-party conversations , 2014, INTERSPEECH.

[17]  Climent Nadeu,et al.  Automatic Speech Activity Detection, Source Localization, and Speech Recognition on the Chil Seminar Corpus , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[18]  Jean Carletta,et al.  The AMI Meeting Corpus: A Pre-announcement , 2005, MLMI.

[19]  Xavier Anguera Miró,et al.  Acoustic Beamforming for Speaker Diarization of Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  Mary P. Harper,et al.  VACE Multimodal Meeting Corpus , 2005, MLMI.

[21]  Tatsuya Kawahara,et al.  Smart posterboard: Multi-modal sensing and analysis of poster conversations , 2013, 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference.

[22]  Xavier Anguera Miró,et al.  Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences , 2006, INTERSPEECH.

[23]  Daniel Gatica-Perez,et al.  Automatic nonverbal analysis of social interaction in small groups: A review , 2009, Image Vis. Comput..

[24]  Yuichi Nakamura,et al.  Cubistic Representation for Real-Time 3D Shape and Pose Estimation of Unknown Rigid Object , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[25]  Douglas A. Reynolds,et al.  An overview of automatic speaker diarization systems , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Athanasios Katsamanis,et al.  Acoustic and Visual Cues of Turn-Taking Dynamics in Dyadic Interactions , 2011, INTERSPEECH.

[27]  Tatsuya Kawahara,et al.  Estimation of interest and comprehension level of audience through multi-modal behaviors in poster conversations , 2013, INTERSPEECH.

[28]  S. Araki,et al.  A DOA Based Speaker Diarization System for Real Meetings , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[29]  Nigel G. Ward,et al.  Prosodic features which cue back-channel responses in English and Japanese , 2000 .

[30]  Louis-Philippe Morency,et al.  Modeling Wisdom of Crowds Using Latent Mixture of Discriminative Experts , 2011, ACL.

[31]  Tatsuya Kawahara,et al.  Detection of hot spots in poster conversations based on reactive tokens of audience , 2010, INTERSPEECH.

[32]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[33]  Peter Wittenburg,et al.  Speaker diarization using gesture and speech , 2014, INTERSPEECH.