Multi-modal Fusion Framework with Particle Filter for Speaker Tracking

In the domain of Human-Computer Interaction (HCI), the main focus of the computer is to interpret the external stimuli provided by users. Moreover in the multi-person scenarios, it is important to localize and track the speaker. To solve this issue, we introduce here a framework by which multi-modal sensory data can be eciently and meaningfully combined in the application of speaker tracking. This framework fuses together four dierent observation types taken from multi-modal sensors. The advantages of this fusion are that weak sensory data from either modality can be reinforced, and the presence of noise can be reduced. We propose a method of combining these modalities by employing a particle lter. This method oers satised real-time performance. We demonstrate results of a speaker localization in two- and three-person scenarios.

[1]  N. A. Abdul Rahim,et al.  RGB-H-CbCr skin colour model for human face detection , 2006 .

[2]  Michael R. M. Jenkin,et al.  Audiovisual localization of multiple speakers in a video teleconferencing setting , 2003, Int. J. Imaging Syst. Technol..

[3]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4]  Andrea Cavallaro,et al.  Target Detection and Tracking With Heterogeneous Sensors , 2008, IEEE Journal of Selected Topics in Signal Processing.

[5]  G. Carter,et al.  The generalized correlation method for estimation of time delay , 1976 .

[6]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  G. Bachur,et al.  1 Separation of Voiced and Unvoiced using Zero crossing rate and Energy of the Speech Signal , 2008 .

[8]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[9]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[10]  Patrick Pérez,et al.  Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[11]  Ayoub Al-Hamadi,et al.  Audio-Visual Data Fusion Using a Particle Filter in the Application of Face Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[12]  Ayoub Al-Hamadi,et al.  Coping with hand-hand overlapping in bimanual movements , 2011, 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA).

[13]  Michael Isard,et al.  The CONDENSATION Algorithm - Conditional Density Propagation and Applications to Visual Tracking , 1996, NIPS.

[14]  Mohan M. Trivedi,et al.  Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey , 2010, Proceedings of the IEEE.

[15]  Raimondo Schettini,et al.  Skin segmentation using multiple thresholding , 2006, Electronic Imaging.