Audiovisual data fusion for successive speakers tracking

In this paper, a human speaker tracking method on audio and video data is presented. It is applied to conversation tracking with a robot. Audiovisual data fusion is performed in a two-steps process. Detection is performed independently on each modality: face detection based on skin color on video data and sound source localization based on the time delay of arrival on audio data. The results of those detection processes are then fused thanks to an adaptation of bayesian filter to detect the speaker. The robot is able to detect the face of the talking person and to detect a new speaker in a conversation.

[1]  Jean Rouat,et al.  Localization of simultaneous moving sound sources for mobile robot using a frequency- domain steered beamformer approach , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[2]  JongSuk Choi,et al.  Audio-visual data fusion for tracking the direction of multiple speakers , 2010, ICCAS 2010.

[3]  Sethu Vijayakumar,et al.  Structure Inference for Bayesian Multisensor Scene Understanding , 2007 .

[4]  Anil K. Jain,et al.  Interacting multiple model (IMM) Kalman filters for robust high speed human motion tracking , 2002, Object recognition supported by user interaction for service robots.

[5]  R. Vaillant,et al.  Original approach for the localisation of objects in images , 1994 .

[6]  Zhengyou Zhang,et al.  A Survey of Recent Advances in Face Detection , 2010 .

[7]  Mohan M. Trivedi,et al.  Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Federico Girosi,et al.  Training support vector machines: an application to face detection , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[10]  Fredrik Gustafsson,et al.  Positioning using time-difference of arrival measurements , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Takeo Kanade,et al.  Probabilistic modeling of local appearance and spatial relationships for object recognition , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[12]  King Ngi Ngan,et al.  Locating facial region of a head-and-shoulders color image , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[13]  Wolfram Burgard,et al.  Probabilistic Robotics (Intelligent Robotics and Autonomous Agents) , 2005 .

[14]  Michael S. Brandstein,et al.  A practical methodology for speech source localization with microphone arrays , 1997, Comput. Speech Lang..