Humanoid Early Visual Processing Using Attention Mechanisms

The basic, biologically inspired idea of the authors (and some more in the scientific community) is to apply an attention-based type of filter on the great amount of visual input data and only perform further analysis on the tiny rest. This residual is what we call the regions of interest (ROIs). There have been many approaches to computation of salient features in a static image, e.g. [1] shows that high contrast regions seem to attract attention or [2] reports that salient regions can be computed using multiscale images. Others on the other hand argue that local complexity can be a measure of saliency [3]. Also, a learning approach for visual saliency models has been proposed recently [4]. Following these ideas, one foundation of our approach is the claim, that fundamental attention attractors originating from sensory input can be either static salient features in a single frame or dynamics in the input data sequence (considering temporal properties). Inspired by the idea in [5], which is claimed to be biologically plausible, we extend the saliency attention approach with the idea, that vision is a process of active and sometimes even volitional exploration of the environment. Thus, considering the second before mentioned assumption, i.e. based on the theory of inhibition of return as shown to be plausible in human visual psychophysics [6], we implement top-down cognitive feedback in the proposed system. Moreover, we will go one step further and integrate a possiblity not only for attention inhibition, but also for directed attention guidance. This reinforcement is triggered by cognitive processes reasoning about relevant additional information to gain from a specific region (see Section III). Although we do not claim to implement the entire framework of [5] e.g. the inattentional or change blindness, we in deed show that a system utilizing the basic ideas performs considerably better than without. Not contradicting, but complementing the work of other authors [7], [8], we do not want to focus solely on building a biologically plausible visual systems, but our primary target is to apply the underlying ideas of such frameworks to a real-world robotic setup. We therefore avoid complex neural, connectionist or machine learning techniques where possible, giving preference to discrete algorithms. These fast and efficient algorithms allow for realtime performance and high accuracy for manipulation tasks on standard hardware. The vision system presented in this paper is part of the JAST1 human-robot dialog system.

[1]  M. Posner,et al.  Components of visual orienting , 1984 .

[2]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[3]  J. Stroop Studies of interference in serial verbal reactions. , 1992 .

[4]  P Reinagel,et al.  Natural scene statistics at the centre of gaze. , 1999, Network.

[5]  Alois Knoll,et al.  Integrating Language, Vision and Action for Human Robot Dialog Systems , 2007, HCI.

[6]  A. Knoll,et al.  A Novel Approach to Hand-Gesture Recognition in a Human-Robot Dialog System , 2008, 2008 First Workshops on Image Processing Theory, Tools and Applications.

[7]  Michael Brady,et al.  Saliency, Scale and Image Description , 2001, International Journal of Computer Vision.

[8]  Christof Koch,et al.  Attentional Selection for Object Recognition - A Gentle Way , 2002, Biologically Motivated Computer Vision.

[9]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[10]  Philippas Tsigas,et al.  Fast and lock-free concurrent priority queues for multi-thread systems , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[11]  Mark S. Gilzenrat,et al.  A Systems-Level Perspective on Attention and Cognitive Control: Guided Activation, Adaptive Gating, Conflict Monitoring, and Exploitation versus Exploration. , 2004 .

[12]  Liang-Tien Chia,et al.  Detection of visual attention regions in images using robust subspace analysis , 2008, J. Vis. Commun. Image Represent..

[13]  Alois Knoll,et al.  Human-Robot dialogue for joint construction tasks , 2006, ICMI '06.

[14]  Alois Knoll,et al.  A Wait-free Realtime System for Optimal Distribution of Vision Tasks on Multicore Architectures , 2008, ICINCO-RA.

[15]  A. Noë,et al.  A sensorimotor account of vision and visual consciousness. , 2001, The Behavioral and brain sciences.

[16]  Ramin Zabih,et al.  Comparing images using joint histograms , 1999, Multimedia Systems.

[17]  C. Gilbert,et al.  Perceptual learning and top-down influences in primary visual cortex , 2004, Nature Neuroscience.

[18]  M. Posner,et al.  Inhibition of return : Neural basis and function , 1985 .

[19]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[20]  Alois Knoll,et al.  Integrating Multimodal Cues Using Grammar Based Models , 2007, HCI.

[21]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[22]  Bernhard Schölkopf,et al.  A Nonparametric Approach to Bottom-Up Visual Saliency , 2006, NIPS.