Learning to infer human attention in daily activities

Abstract The first attention model in the computer science community is proposed in 1998. In the following years, human attention has been intensively studied. However, these studies mainly refer human attention as the image regions that draw the attention of a human (outside the image) who is looking at the image. In this paper, we infer the attention of a human inside a third-person view video where the human is doing a task, and define human attention as attentional objects that coincide with the task the human is doing. To infer human attention, we propose a deep neural network model that fuses both low-level human pose cue and high-level task encoding cue. Due to the lack of appropriate public datasets for studying this problem, we newly collect a video dataset in complex Virtual-Reality (VR) scenes. In the experiments, we widely compare our method with three other methods on this VR dataset. In addition, we re-annotate a public real dataset and conduct the extensional experiments on this real dataset. The experiment results validate the effectiveness of our method.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  José María Martínez Sanchez,et al.  Visual attention based on a joint perceptual space of color and brightness for improved video tracking , 2016, Pattern Recognit..

[3]  Otmar Hilliges,et al.  Deep Pictorial Gaze Estimation , 2018, ECCV.

[4]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Joint Event Segmentation, Recognition, and Object Localization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[6]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[7]  Majid Nili Ahmadabadi,et al.  Online learning of task-driven object-based visual attention control , 2010, Image Vis. Comput..

[8]  Haibin Ling,et al.  Salient Object Detection in the Deep Learning Era: An In-Depth Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Su-Ling Yeh,et al.  Object-based attention occurs regardless of object awareness , 2012, Psychonomic bulletin & review.

[11]  Yang Liu,et al.  Jointly Recognizing Object Fluents and Tasks in Egocentric Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  M. Corbetta,et al.  Control of goal-directed and stimulus-driven attention in the brain , 2002, Nature Reviews Neuroscience.

[14]  Mario Fritz,et al.  Appearance-based gaze estimation in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[17]  Mohan S. Kankanhalli,et al.  Interact as You Intend: Intention-Driven Human-Object Interaction Detection , 2018, IEEE Transactions on Multimedia.

[18]  Markus Vincze,et al.  Towards a Robot for Supporting Older People to Stay Longer Independent at Home , 2014, ISR 2014.

[19]  Athanasios V. Vasilakos,et al.  Dynamic Intelligent Lighting for Directing Visual Attention in Interactive 3-D Scenes , 2009, IEEE Transactions on Computational Intelligence and AI in Games.

[20]  Steven K. Feiner,et al.  Gaze locking: passive eye contact detection for human-object interaction , 2013, UIST.

[21]  Nanning Zheng,et al.  Inferring Human Attention by Learning Latent Intentions , 2017, IJCAI.

[22]  James M. Rehg,et al.  The Secrets of Salient Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Antonio Torralba,et al.  Following Gaze in Video , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Mario Fritz,et al.  MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[26]  Ruigang Yang,et al.  Inferring Salient Objects from Human Fixations , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[28]  Ling Shao,et al.  Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[29]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Mario Fritz,et al.  It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Jing-Yu Yang,et al.  Content-based image retrieval using computational visual attention model , 2015, Pattern Recognit..

[33]  Matthias Bethge,et al.  Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet , 2014, ICLR.

[34]  Xi Zhou,et al.  Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network , 2018, ECCV.

[35]  Jean-Marc Odobez,et al.  EYEDIAP: a database for the development and evaluation of gaze estimation algorithms from RGB and RGB-D cameras , 2014, ETRA.

[36]  Ali Borji,et al.  Augmented saliency model using automatic 3D head pose detection and learned gaze following in natural scenes , 2015, Vision Research.

[37]  Pieter R. Roelfsema,et al.  A Growth-Cone Model for the Spread of Object-Based Attention during Contour Grouping , 2014, Current Biology.

[38]  Leslie G. Ungerleider,et al.  Attentional selection of multiple objects in the human visual system , 2017, NeuroImage.

[39]  Axel Seemann,et al.  Joint attention : new developments in psychology, philosophy of mind, and social neuroscience , 2011 .

[40]  Bo Du,et al.  Unsupervised Scene Change Detection via Latent Dirichlet Allocation and Multivariate Alteration Detection , 2018, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[41]  Huchuan Lu,et al.  Saliency Detection via Graph-Based Manifold Ranking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[43]  Andrew Bodenhamer,et al.  Enhanced operator perception through 3D vision and haptic feedback , 2012, Defense, Security, and Sensing.

[44]  James M. Rehg,et al.  Fine-Grained Head Pose Estimation Without Keypoints , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45]  B. Scholl Objects and attention: the state of the art , 2001, Cognition.

[46]  Peter Robinson,et al.  A 3D Morphable Eye Region Model for Gaze Estimation , 2016, ECCV.

[47]  Antonio Torralba,et al.  Where are they looking? , 2015, NIPS.

[48]  Wei Xiong,et al.  Combining local and global: Rich and robust feature pooling for visual recognition , 2017, Pattern Recognit..

[49]  Marina Fridin,et al.  Kindergarten assistive robotics (KAR) as a tool for spatial cognition development in pre-school education , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[50]  Ruigang Yang,et al.  Saliency-Aware Video Object Segmentation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Antígona Martínez,et al.  Objects Are Highlighted by Spatial Attention , 2006, Journal of Cognitive Neuroscience.

[52]  Zhe Chen Object-based attention: A tutorial review , 2012, Attention, Perception, & Psychophysics.

[53]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[54]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[55]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Nanning Zheng,et al.  Modeling 4D Human-Object Interactions for Event and Object Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[57]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[58]  Qiang Ji,et al.  Real Time Eye Gaze Tracking with 3D Deformable Eye-Face Model , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  John K. Tsotsos,et al.  Saliency Based on Information Maximization , 2005, NIPS.

[60]  Jia Deng,et al.  Learning to Detect Human-Object Interactions , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[61]  Song-Chun Zhu,et al.  VRKitchen: an Interactive 3D Environment for Learning Real Life Cooking Tasks , 2019 .

[62]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[63]  Haibin Ling,et al.  Revisiting Video Saliency Prediction in the Deep Learning Era , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.