Head Pose and Neural Network Based Gaze Direction Estimation for Joint Attention Modeling in Embodied Agents

Imitation is a powerful capability of infants, relevant for bootstrapping many cognitive capabilities like communication, language and learning under supervision. In infants, this skill relies on establishing a joint attentional link with the teaching party. In this work we propose a method for establishing the joint attention between an experimenter and an embodied agent. The agent first estimates the head pose of the experimenter, based on tracking with a cylindrical head model. Then two separate neural network regressors are used to interpolate the gaze direction and the target object depth from the computed head pose estimates. A bottom-up feature-based saliency model is used to select and attend to objects in a restricted visual field indicated by the gaze direction. We demonstrate our system on a number of recordings where the experimenter selects and attends to an object among several alternatives. Our results suggest that rapid gaze estimation can be achieved for establishing joint attention in interaction-driven robot training, which is a very promising testbed for hypotheses of cognitive development and genesis of visual communication.

[1]  Matthew W. Hoffman,et al.  A probabilistic model of gaze imitation and shared attention , 2006, Neural Networks.

[2]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[3]  Alexander H. Waibel,et al.  Modeling focus of attention for meeting indexing , 1999, MULTIMEDIA '99.

[4]  Ying Wu,et al.  Wide-range, person- and illumination-insensitive head orientation estimation , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).

[5]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[6]  Theo Gevers,et al.  Robustifying eye center localization by head pose cues , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Ramakant Nevatia,et al.  Video-based event recognition: activity representation and probabilistic recognition methods , 2004, Comput. Vis. Image Underst..

[8]  F. Kaplan,et al.  The challenges of joint attention , 2006 .

[9]  Gedeon O. Deák,et al.  Nine-month-olds' shared visual attention as a function of gesture and object location. , 2004 .

[10]  Takeo Kanade,et al.  An Iterative Image Registration Technique with an Application to Stereo Vision , 1981, IJCAI.

[11]  Rajesh P. N. Rao,et al.  A Cognitive Model of Imitative Development in Humans and Machines , 2007, Int. J. Humanoid Robotics.

[12]  C. Moore,et al.  Development of joint visual attention in infants , 1995 .

[13]  Minoru Asada,et al.  Emergence of Joint Attention based on Visual Attention and Self Learning , 2003 .

[14]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[15]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[16]  Jochen Triesch,et al.  Emergence of Mirror Neurons in a Model of Gaze Following , 2007, Adapt. Behav..

[17]  A. Meltzoff,et al.  Explaining Facial Imitation: A Theoretical Model. , 1997, Early development & parenting.

[18]  Qiang Ji,et al.  In the Eye of the Beholder: A Survey of Models for Eyes and Gaze , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Reinhard Moratz,et al.  Affordance-Based Human-Robot Interaction , 2006, Towards Affordance-Based Robot Control.

[20]  C. Moore,et al.  Joint attention : its origins and role in development , 1995 .