Learning to Predict Gaze in Egocentric Video

We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer's behaviors. Specifically, we compute the camera wearer's head motion and hand location from the video and combine them to estimate where the eyes look. We further model the dynamic behavior of the gaze, in particular fixations, as latent variables to improve the gaze prediction. Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocentric vision datasets. In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods.

[1]  Takahiro Okabe,et al.  Attention Prediction in Egocentric Video Using Motion and Visual Saliency , 2011, PSIVT.

[2]  M. Hayhoe,et al.  The coordination of eye, head, and hand movements in a natural task , 2001, Experimental Brain Research.

[3]  J. D. Crawford,et al.  Coordinate transformations for hand-guided saccades , 2009, Experimental Brain Research.

[4]  Michael F. Land,et al.  The coordination of rotations of the eyes, head and trunk in saccadic turns produced in natural situations , 2004, Experimental Brain Research.

[5]  Cristian Sminchisescu,et al.  Constrained parametric min-cuts for automatic object segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[7]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[8]  Ali Borji,et al.  Probabilistic learning of task-specific visual attention , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Takeo Kanade,et al.  First-Person Vision , 2012, Proceedings of the IEEE.

[11]  Jitendra Malik,et al.  Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Antonio Criminisi,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[13]  Chen Yu,et al.  Understanding Human Behaviors Based on Eye-Head-Hand Coordination , 2002, Biologically Motivated Computer Vision.

[14]  Loong Fah Cheong,et al.  Active Visual Segmentation , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Marcus Nyström,et al.  An adaptive algorithm for fixation, saccade, and glissade detection in eyetracking data , 2010, Behavior research methods.

[16]  Jean-Marc Odobez,et al.  Multiperson Visual Focus of Attention from Head Pose and Meeting Contextual Cues , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Ali Farhadi,et al.  Understanding egocentric activities , 2011, 2011 International Conference on Computer Vision.

[18]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[19]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Martial Hebert,et al.  Temporal segmentation and activity classification from first-person sensing , 2009, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[21]  M. Hayhoe,et al.  In what ways do eye movements contribute to everyday activities? , 2001, Vision Research.

[22]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[23]  Liqing Zhang,et al.  Dynamic visual attention: searching for coding length increments , 2008, NIPS.

[24]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.