Where and Why are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks

This paper addresses a new problem - jointly inferring human attention, intentions, and tasks from videos. Given an RGB-D video where a human performs a task, we answer three questions simultaneously: 1) where the human is looking - attention prediction; 2) why the human is looking there - intention prediction; and 3) what task the human is performing - task recognition. We propose a hierarchical model of human-attention-object (HAO) which represents tasks, intentions, and attention under a unified framework. A task is represented as sequential intentions which transition to each other. An intention is composed of the human pose, attention, and objects. A beam search algorithm is adopted for inference on the HAO graph to output the attention, intention, and task results. We built a new video dataset of tasks, intentions, and attention. It contains 14 task classes, 70 intention categories, 28 object classes, 809 videos, and approximately 330,000 frames. Experiments show that our approach outperforms existing approaches.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Jeffrey F. Cohn,et al.  Person-Independent 3D Gaze Estimation Using Face Frontalization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[4]  Jean-Marc Odobez,et al.  Gaze Estimation in the 3D Space Using RGB-D Sensors , 2015, International Journal of Computer Vision.

[5]  Nanning Zheng,et al.  Inferring Human Attention by Learning Latent Intentions , 2017, IJCAI.

[6]  Dan Xie Inferring the Intentions and Attentions of Agents from Videos , 2016 .

[7]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  James M. Rehg,et al.  Social interactions: A first-person perspective , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  M. Hayhoe,et al.  The coordination of eye, head, and hand movements in a natural task , 2001, Experimental Brain Research.

[11]  Manuel Lopes,et al.  Facilitating intention prediction for humans by optimizing robot motions , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[12]  Yang Liu,et al.  Jointly Recognizing Object Fluents and Tasks in Egocentric Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13]  Mario Fritz,et al.  Appearance-based gaze estimation in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Antonio Torralba,et al.  Predicting Motivations of Actions by Leveraging Text , 2014, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Junsong Yuan,et al.  Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Martin Volker Butz,et al.  Goal-oriented gaze strategies afforded by object interaction , 2015, Vision Research.

[17]  Ali Borji,et al.  Augmented saliency model using automatic 3D head pose detection and learned gaze following in natural scenes , 2015, Vision Research.

[18]  J. Decety,et al.  From the perception of action to the understanding of intention , 2001, Nature reviews. Neuroscience.

[19]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  James M. Rehg,et al.  Learning to Predict Gaze in Egocentric Video , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Ali Borji,et al.  What/Where to Look Next? Modeling Top-Down Visual Attention in Complex Interactive Environments , 2014, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[22]  Ali Borji,et al.  Probabilistic learning of task-specific visual attention , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Brian Scassellati,et al.  Foundations for a theory of mind for a humanoid robot , 2001 .

[24]  Ashwin P. Dani,et al.  Bayesian human intention inference through multiple model filtering with gaze-based priors , 2016, 2016 19th International Conference on Information Fusion (FUSION).

[25]  Darius Burschka,et al.  Predicting human intention in visual observations of hand/object interactions , 2013, 2013 IEEE International Conference on Robotics and Automation.

[26]  Filip Germeys,et al.  Perceiving where another person is looking: the integration of head and body information in estimating another person’s gaze , 2015, Front. Psychol..

[27]  Hannes Rakoczy,et al.  Symposium on S. Butterfill and I. Apperly, "How to Construct a Minimal Theory of Mind" , 2013 .

[28]  M. Land,et al.  The Roles of Vision and Eye Movements in the Control of Activities of Daily Living , 1998, Perception.

[29]  B. Scholl Objects and attention: the state of the art , 2001, Cognition.

[30]  Neil J. Gordon,et al.  A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking , 2002, IEEE Trans. Signal Process..

[31]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.

[32]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[33]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[34]  J. Odobez,et al.  Gaze Estimation in the 3 D Space Using RGB-D sensors Towards Head-Pose And User Invariance , 2015 .

[35]  Shimon Ullman,et al.  From simple innate biases to complex visual concepts , 2012, Proceedings of the National Academy of Sciences.

[36]  Yoichi Sato,et al.  Learning-by-Synthesis for Appearance-Based 3D Gaze Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Yusuke Sugano,et al.  3D gaze estimation from 2D pupil positions on monocular head-mounted eye trackers , 2016, ETRA.

[38]  Bilge Mutlu,et al.  Using gaze patterns to predict task intent in collaboration , 2015, Front. Psychol..

[39]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[40]  Nancy Kanwisher,et al.  Measuring and modeling the perception of natural and unconstrained gaze in humans and machines , 2016, ArXiv.

[41]  Linda B. Smith,et al.  Linking Joint Attention with Hand-Eye Coordination - A Sensorimotor Approach to Understanding Child-Parent Social Interaction , 2015, CogSci.

[42]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[43]  Qiang Ji,et al.  A robust 3D eye gaze tracking system using noise reduction , 2008, ETRA.