论文信息 - Multi-modal Fusion Framework with Particle Filter for Speaker Tracking

Multi-modal Fusion Framework with Particle Filter for Speaker Tracking

In the domain of Human-Computer Interaction (HCI), the main focus of the computer is to interpret the external stimuli provided by users. Moreover in the multi-person scenarios, it is important to localize and track the speaker. To solve this issue, we introduce here a framework by which multi-modal sensory data can be eciently and meaningfully combined in the application of speaker tracking. This framework fuses together four dierent observation types taken from multi-modal sensors. The advantages of this fusion are that weak sensory data from either modality can be reinforced, and the presence of noise can be reduced. We propose a method of combining these modalities by employing a particle lter. This method oers satised real-time performance. We demonstrate results of a speaker localization in two- and three-person scenarios.

Michael Heuer | Ayoub Al-Hamadi | Otto-von-Guericke-University Magdeburg | Anwar Saeed

[1] N. A. Abdul Rahim,et al. RGB-H-CbCr skin colour model for human face detection , 2006 .

[2] Michael R. M. Jenkin,et al. Audiovisual localization of multiple speakers in a video teleconferencing setting , 2003, Int. J. Imaging Syst. Technol..

[3] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4] Andrea Cavallaro,et al. Target Detection and Tracking With Heterogeneous Sensors , 2008, IEEE Journal of Selected Topics in Signal Processing.

[5] G. Carter,et al. The generalized correlation method for estimation of time delay , 1976 .

[6] Jean-Marc Odobez,et al. Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[7] G. Bachur,et al. 1 Separation of Voiced and Unvoiced using Zero crossing rate and Energy of the Speech Signal , 2008 .

[8] Larry S. Davis,et al. Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[9] Timothy J. Robinson,et al. Sequential Monte Carlo Methods in Practice , 2003 .

[10] Patrick Pérez,et al. Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[11] Ayoub Al-Hamadi,et al. Audio-Visual Data Fusion Using a Particle Filter in the Application of Face Recognition , 2010, 2010 20th International Conference on Pattern Recognition.

[12] Ayoub Al-Hamadi,et al. Coping with hand-hand overlapping in bimanual movements , 2011, 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA).

[13] Michael Isard,et al. The CONDENSATION Algorithm - Conditional Density Propagation and Applications to Visual Tracking , 1996, NIPS.

[14] Mohan M. Trivedi,et al. Audiovisual Information Fusion in Human–Computer Interfaces and Intelligent Environments: A Survey , 2010, Proceedings of the IEEE.

[15] Raimondo Schettini,et al. Skin segmentation using multiple thresholding , 2006, Electronic Imaging.