Real-Time Auditory and Visual Multiple-Object Tracking for Humanoids

This paper presents a real-time auditory and visual tracking of multiple objects for humanoid under real-world environments. Real-time processing is crucial for sensorimotor tasks in tracking, and multiple-object tracking is crucial for real-world applications. Multiple sound source tracking needs perception of a mixture of sounds and cancellation of motor noises caused by body movements. However its real-time processing has not been reported yet. Real-time tracking is attained by fusing information obtained by sound source localization, multiple face recognition, speaker tracking, focus of attention control, and motor control. Auditory streams with sound source direction are extracted by active audition system with motor noise cancellation capability from 48KHz sampling sounds. Visual streams with face ID and 3D-position are extracted by combining skincolor extraction, correlation-based matching, and multiple-scale image generation from a single camera. These auditory and visual streams are associated by comparing the spatial location, and associated streams are used to control focus of attention. Auditory, visual, and association processing are performed asynchronously on different PC's connected by TCP/IP network. The resulting system implemented on an upper-torso humanoid can track multiple objects with the delay of 200 msec, which is forced by visual tracking and network latency.

[1]  Robin R. Murphy,et al.  Dempster-Shafer theory for sensor fusion in autonomous mobile robots , 1998, IEEE Trans. Robotics Autom..

[2]  Hiroaki Kitano,et al.  Epipolar geometry based sound localization and extraction for humanoid audition , 2001, Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180).

[3]  Hiroaki Kitano,et al.  Using Vision to Improve Sound Source Separation , 1999, AAAI/IAAI.

[4]  Satoru Hayamizu,et al.  Socially Embedded Learning of the Office-Conversant Mobil Robot Jijo-2 , 1997, IJCAI.

[5]  Takashi Matsuyama,et al.  Dynamic memory: architecture for real time integration of visual perception, camera action, and network communication , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[6]  Sebastian Thrun,et al.  Template-Based Recognition of Pose and Motion Gestures On a Mobile Robot , 1998, AAAI/IAAI.

[7]  Hiroshi Mizoguchi,et al.  Robust face detection against brightness fluctuation and size variation , 2000, Proceedings. 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2000) (Cat. No.00CH37113).

[8]  Brian Scassellati,et al.  Alternative Essences of Intelligence , 1998, AAAI/IAAI.

[9]  Hiroshi Mizoguchi,et al.  Convergence analysis of online linear discriminant analysis , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[10]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[11]  Steven A. Shafer,et al.  An architecture for sensor fusion in a mobile robot , 1986, Proceedings. 1986 IEEE International Conference on Robotics and Automation.

[12]  Brian Scassellati,et al.  A Context-Dependent Attention System for a Social Robot , 1999, IJCAI.

[13]  Guy J. Brown,et al.  Computational auditory scene analysis , 1994, Comput. Speech Lang..

[14]  Hiroaki Kitano,et al.  Active Audition for Humanoid , 2000, AAAI/IAAI.

[15]  Tetsunori Kobayashi,et al.  Multi-person conversation via multi-modal interface - a robot who communicate with multi-user - , 1999, EUROSPEECH.