论文信息 - Vision-based Active Speaker Detection in Multiparty Interaction

Vision-based Active Speaker Detection in Multiparty Interaction

This paper presents a supervised learning method for automatic visual detection of the active speaker in multiparty interactions. The presented detectors are built using a multimodal multiparty interaction dataset previously recorded with the purpose to explore patterns in the focus of visual attention of humans. Three different conditions are included: two humans involved in taskbased interaction with a robot; the same two humans involved in task-based interaction where the robot is replaced by a third human, and a free three-party human interaction. The paper also presents an evaluation of the active speaker detection method in a speaker dependent experiment showing that the method achieves good accuracy rates in a fairly unconstrained scenario using only image data as input. The main goal of the presented method is to provide real-time detection of the active speaker within a broader framework implemented on a robot and used to generate natural focus of visual attention behavior during multiparty human-robot interactions.

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Harriet J. Nock,et al. Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[3] Sudeep Sarkar,et al. Exploring Co-Occurence Between Speech and Body Movement for Audio-Guided Video Localization , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[4] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5] Jingwen Dai,et al. Deep Multimodal Speaker Naming , 2015, ACM Multimedia.

[6] Paul A. Viola,et al. Boosting-Based Multimodal Speaker Detection for Distributed Meetings , 2006, 2006 IEEE Workshop on Multimedia Signal Processing.

[7] Akihiro Sugimoto,et al. Look who's talking: visual identification of the active speaker in multi-party human-robot interaction , 2016, ASSP4MI '16.

[8] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Chuohao Yeo,et al. Visual speaker localization aided by acoustic models , 2009, MM '09.

[11] Malcolm Slaney,et al. FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks , 2000, NIPS.

[12] Sileye O. Ba,et al. Speech/Non-Speech Detection in Meetings from Automatically Extracted low Resolution Visual Features , 2010, ICASSP.

[13] Murat Kunt,et al. Hypothesis testing for evaluating a multimodal pattern recognition framework applied to speaker detection , 2007, Journal of NeuroEngineering and Rehabilitation.

[14] Chuan Wang,et al. Look, Listen and Learn - A Multimodal LSTM for Speaker Identification , 2016, AAAI.

[15] Gabriel Skantze,et al. IrisTK: a statechart-based toolkit for multi-party face-to-face interaction , 2012, ICMI '12.

[16] Nicholas W. D. Evans,et al. Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[18] Jonas Beskow,et al. A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction , 2016, LREC.