Speaker selection and tracking in a cluttered environment with audio and visual information

Presented in this paper is a data association method using audio and visual data which localizes targets in a cluttered environment and detects who is speaking to a robot. A particle filter is applied to efficiently select the optimal association between the target and the measurements. State variables are composed of target positions and speaking states. To update the speaking state, we first evaluate the incoming sound signal based on cross-correlation and then calculate a likelihood from the audio information. The visual measurement is used to find an optimal association between the target and the observed objects. The number of targets that the robot should interact with is updated from the existence probabilities and associations. Experimental data were collected beforehand and simulated on a computer to verify the performance of the proposed method applied to the speaker selection problem in a cluttered environment. The algorithm was also implemented in a robotic system to demonstrate reliable interactions between the robot and speaking targets.

[1]  Wolfram Burgard,et al.  Experiences with an Interactive Museum Tour-Guide Robot , 1999, Artif. Intell..

[2]  Darren B. Ward,et al.  Particle filtering algorithms for tracking an acoustic source in a reverberant environment , 2003, IEEE Trans. Speech Audio Process..

[3]  Patrick Pérez,et al.  Color-Based Probabilistic Tracking , 2002, ECCV.

[4]  Hans P. Moravec,et al.  High resolution maps from wide angle sonar , 1985, Proceedings. 1985 IEEE International Conference on Robotics and Automation.

[5]  Ronald W. Schafer,et al.  Digital Processing of Speech Signals , 1978 .

[6]  Michael Isard,et al.  CONDENSATION—Conditional Density Propagation for Visual Tracking , 1998, International Journal of Computer Vision.

[7]  Wolfram Burgard,et al.  People Tracking with Mobile Robots Using Sample-Based Joint Probabilistic Data Association Filters , 2003, Int. J. Robotics Res..

[8]  Munsang Kim,et al.  Human-Robot Interaction in Real Environments by Audio-Visual Integration , 2007 .

[9]  Neil J. Gordon,et al.  Editors: Sequential Monte Carlo Methods in Practice , 2001 .

[10]  Wolfram Burgard,et al.  The Interactive Museum Tour-Guide Robot , 1998, AAAI/IAAI.

[11]  Jean-Marc Odobez,et al.  Audiovisual Probabilistic Tracking of Multiple Speakers in Meetings , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  S. Godsill,et al.  Monte Carlo filtering for multi target tracking and data association , 2005, IEEE Transactions on Aerospace and Electronic Systems.

[13]  Vladimir Pavlovic,et al.  Multimodal speaker detection using error feedback dynamic Bayesian networks , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[14]  Nando de Freitas,et al.  Sequential Monte Carlo Methods in Practice , 2001, Statistics for Engineering and Information Science.

[15]  Ramakant Nevatia,et al.  Tracking multiple humans in crowded environment , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[16]  A. Blake,et al.  Sequential Monte Carlo fusion of sound and vision for speaker tracking , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.