Vision-based Active Speaker Detection in Multiparty Interaction

This paper presents a supervised learning method for automatic visual detection of the active speaker in multiparty interactions. The presented detectors are built using a multimodal multiparty interaction dataset previously recorded with the purpose to explore patterns in the focus of visual attention of humans. Three different conditions are included: two humans involved in taskbased interaction with a robot; the same two humans involved in task-based interaction where the robot is replaced by a third human, and a free three-party human interaction. The paper also presents an evaluation of the active speaker detection method in a speaker dependent experiment showing that the method achieves good accuracy rates in a fairly unconstrained scenario using only image data as input. The main goal of the presented method is to provide real-time detection of the active speaker within a broader framework implemented on a robot and used to generate natural focus of visual attention behavior during multiparty human-robot interactions.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[3]  Sudeep Sarkar,et al.  Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Jingwen Dai,et al.  Deep Multimodal Speaker Naming , 2015, ACM Multimedia.

[6]  Paul A. Viola,et al.  Boosting-Based Multimodal Speaker Detection for Distributed Meetings , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[7]  Akihiro Sugimoto,et al.  Look who's talking: visual identification of the active speaker in multi-party human-robot interaction , 2016, ASSP4MI '16.

[8]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Chuohao Yeo,et al.  Visual speaker localization aided by acoustic models , 2009, MM '09.

[11]  Malcolm Slaney,et al.  FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[12]  Sileye O. Ba,et al.  Speech/Non-Speech Detection in Meetings from Automatically Extracted low Resolution Visual Features , 2010, ICASSP.

[13]  Murat Kunt,et al.  Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection , 2007, Journal of NeuroEngineering and Rehabilitation.

[14]  Chuan Wang,et al.  Look, Listen and Learn - A Multimodal LSTM for Speaker Identification , 2016, AAAI.

[15]  Gabriel Skantze,et al.  IrisTK: a statechart-based toolkit for multi-party face-to-face interaction , 2012, ICMI '12.

[16]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[18]  Jonas Beskow,et al.  A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction , 2016, LREC.