A Modular Audio-Visual Scene Analysis and Attention System for Humanoid Robots

We present an audio-visual scene analysis system, which is implemented and evaluated on the ARMAR-III robot head. The modular design allows a fast integration of new algorithms and adaptation on new hardware. Further benefits are automatic module dependency checks and determination of the execution order. The integrated world model manages and serves the acquired data for all modules in a consistent way. The system has a state of the art performance in localization, tracking and classification of persons as well as exploration of whole scenes and unknown items. We use multimodal proto-objects to model and analyze salient stimuli in the environment of the robot to realize the robots’ attention.

[1]  Heiko Wersing,et al.  Online Learning of Objects in a Biologically Motivated Visual Architecture , 2007, Int. J. Neural Syst..

[2]  Rainer Stiefelhagen,et al.  Predicting human gaze using quaternion DCT image signature saliency and face detection , 2012, 2012 IEEE Workshop on the Applications of Computer Vision (WACV).

[3]  Rainer Stiefelhagen,et al.  Multimodal saliency-based attention for object-based scene analysis , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[4]  Gernot A. Fink,et al.  Multi-modal and multi-camera attention in smart environments , 2009, ICMI-MLMI '09.

[5]  A. B. Kahn,et al.  Topological sorting of large networks , 1962, CACM.

[6]  Liu Jing,et al.  Interacting MCMC particle filter for tracking maneuvering target , 2010, Digit. Signal Process..

[7]  Marcel J. T. Reinders,et al.  Isophote properties as features for object detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[9]  Alessandro De Luca,et al.  Adaptive predictive gaze control of a redundant humanoid robot head , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  James Llinas,et al.  Handbook of Multisensor Data Fusion : Theory and Practice, Second Edition , 2008 .

[11]  J. Henderson Human gaze control during real-world scene perception , 2003, Trends in Cognitive Sciences.

[12]  I.A. Essa,et al.  Ubiquitous sensing for smart and aware environments , 2000, IEEE Wirel. Commun..

[13]  Shanchieh Jay Yang,et al.  Designing a data fusion system using a top-down approach , 2009, MILCOM 2009 - 2009 IEEE Military Communications Conference.

[14]  Andreas Ernst,et al.  Face detection with the modified census transform , 2004, Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings..

[15]  Uwe D. Hanebeck,et al.  Data association in a world model for autonomous systems , 2010, 2010 IEEE Conference on Multisensor Fusion and Integration.

[16]  Kristian Kroschel,et al.  Hierarchical, knowledge-oriented opto-acoustic scene analysis for humanoid robots and man-machine interaction , 2010, 2010 IEEE International Conference on Robotics and Automation.

[17]  Gernot A. Fink,et al.  Focusing computational visual attention in multi-modal human-robot interaction , 2010, ICMI-MLMI '10.

[18]  Jürgen Beyerer,et al.  Knowledge-Driven Opto-Acoustic Scene Analysis based on an Object- Oriented World Modeling approach for Humanoid Robots , 2010, ISR/ROBOTIK.

[19]  Rainer Stiefelhagen,et al.  Local appearance based face recognition using discrete cosine transform , 2005, 2005 13th European Signal Processing Conference.

[20]  Henry G. Dietz,et al.  Performance of phase transform for detecting sound sources with microphone arrays in reverberant and noisy environments , 2007, Signal Process..