Audio-Visual Perception System for a Humanoid Robotic Head

One of the main issues within the field of social robotics is to endow robots with the ability to direct attention to people with whom they are interacting. Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors. However, most of these fusion mechanisms have been used in fixed systems, such as those used in video-conference rooms, and thus, they may incur difficulties when constrained to the sensors with which a robot can be equipped. Besides, within the scope of interactive autonomous robots, there is a lack in terms of evaluating the benefits of audio-visual attention mechanisms, compared to only audio or visual approaches, in real scenarios. Most of the tests conducted have been within controlled environments, at short distances and/or with off-line performance measurements. With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio data. Moreover, the performance of this system is evaluated and compared via considering the technical limitations of unimodal systems. The experiments show the promise of the proposed approach for the proactive detection and tracking of speakers in a human-robot interactive framework.

[1]  Rainer Stiefelhagen,et al.  Fast audio-visual multi-person tracking for a humanoid stereo camera head , 2007, 2007 7th IEEE-RAS International Conference on Humanoid Robots.

[2]  Jean-Christophe Terrillon,et al.  Comparative Performance of Different Chrominance Spaces for Color Segmentation and Detection of Human Faces in Complex Scene Images , 1999 .

[3]  John K. Tsotsos,et al.  Saliency, attention, and visual search: an information theoretic approach. , 2009, Journal of vision.

[4]  T. Ogata,et al.  Dynamic communication of humanoid robot with multiple people based on interaction distance , 2004, RO-MAN 2004. 13th IEEE International Workshop on Robot and Human Interactive Communication (IEEE Catalog No.04TH8759).

[5]  José Escolano,et al.  Evaluation of generalized cross-correlation methods for direction of arrival estimation using two microphones in real environments , 2012 .

[6]  Jorge Dias,et al.  A hierarchical Bayesian framework for multimodal active perception , 2012, Adapt. Behav..

[7]  Ning Xiang,et al.  A Bayesian inference model for speech localization (L). , 2012, The Journal of the Acoustical Society of America.

[8]  Illah R. Nourbakhsh,et al.  A survey of socially interactive robots , 2003, Robotics Auton. Syst..

[9]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[10]  F. Scharnowski,et al.  Long-lasting modulation of feature integration by transcranial magnetic stimulation. , 2009, Journal of vision.

[11]  Antonio Bandera,et al.  A Novel Biologically Inspired Attention Mechanism for a Social Robot , 2011, EURASIP J. Adv. Signal Process..

[12]  Antonio Bandera,et al.  A DDS‐based middleware for quality‐of‐service and high‐performance networked robotics , 2012, Concurr. Comput. Pract. Exp..

[13]  John K. Tsotsos Limited Capacity of Any Realizable Perceptual System Is a Sufficient Reason for Attentive Behavior , 1997, Consciousness and Cognition.

[14]  Keisuke Nakamura,et al.  Intelligent Sound Source Localization and its application to multimodal human tracking , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Pierre Baldi,et al.  Bayesian surprise attracts human attention , 2005, Vision Research.

[16]  H. S. Wolff,et al.  iRun: Horizontal and Vertical Shape of a Region-Based Graph Compression , 2022, Sensors.

[17]  Francisco Sandoval Hernández,et al.  Pyramid segmentation algorithms revisited , 2006, Pattern Recognit..

[18]  Hiroaki Kitano,et al.  Real-Time Auditory and Visual Multiple-Object Tracking for Humanoids , 2001, IJCAI.

[19]  J. Wolfe,et al.  Guided Search 2.0 A revised model of visual search , 1994, Psychonomic bulletin & review.

[20]  G. Cheng,et al.  Gaze shift reflex in a humanoid active vision system , 2007 .

[21]  Gordon Cheng,et al.  Biologically Based Top-Down Attention Modulation for Humanoid Interactions , 2008, Int. J. Humanoid Robotics.

[22]  D. Knill,et al.  The Bayesian brain: the role of uncertainty in neural coding and computation , 2004, Trends in Neurosciences.

[23]  T. Poggio,et al.  What and where: A Bayesian inference theory of attention , 2010, Vision Research.

[24]  Naoyuki Ichimura,et al.  Detection and Separation of Speech Event Using Audio and Video Information Fusion and Its Application to Robust Speech Interface , 2004, EURASIP J. Adv. Signal Process..

[25]  Fumio Kanehiro,et al.  Robust speech interface based on audio and video information fusion for humanoid HRP-2 , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[26]  Emmanuel Vincent,et al.  Multi-source TDOA estimation in reverberant audio using angular spectra and clustering , 2012, Signal Process..

[27]  Francisco Sandoval Hernández,et al.  A Novel Hierarchical Framework for Object-Based Visual Attention , 2008, WAPCV.

[28]  Jorge Dias,et al.  A Bayesian framework for active artificial perception , 2013, IEEE Transactions on Cybernetics.

[29]  A. Libin,et al.  Therapeutic robocat for nursing home residents with dementia: Preliminary inquiry , 2004, American journal of Alzheimer's disease and other dementias.

[30]  Gordon Cheng,et al.  Real-time acoustic source localization in noisy environments for human-robot multimodal interaction , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[31]  Simone Frintrop,et al.  Center-surround divergence of feature statistics for salient object detection , 2011, 2011 International Conference on Computer Vision.