Can computers learn from humans to see better?: inferring scene semantics from viewers' eye movements

This paper describes an attempt to bridge the semantic gap between computer vision and scene understanding employing eye movements. Even as computer vision algorithms can efficiently detect scene objects, discovering semantic relationships between these objects is as essential for scene understanding. Humans understand complex scenes by rapidly moving their eyes (saccades) to selectively focus on salient entities (fixations). For 110 social scenes, we compared verbal descriptions provided by observers against eye movements recorded during a free-viewing task. Data analysis confirms (i) a strong correlation between task-explicit linguistic descriptions and task-implicit eye movements, both of which are influenced by underlying scene semantics and (ii) the ability of eye movements in the form of fixations and saccades to indicate salient entities and entity relationships mentioned in scene descriptions. We demonstrate how eye movements are useful for inferring the meaning of social (everyday scenes depicting human activities) and affective (emotion-evoking content like expressive faces, nudes) scenes. While saliency has always been studied through the prism of fixations, we show that saccades are particularly useful for (i) distinguishing mild and high-intensity facial expressions and (ii) discovering interactive actions between scene entities.

[1]  Ronald A. Rensink,et al.  TO SEE OR NOT TO SEE: The Need for Attention to Perceive Changes in Scenes , 1997 .

[2]  G. Kuhn,et al.  You look where I look! Effect of gaze cues on overt and covert attention in misdirection , 2009 .

[3]  Tat-Seng Chua,et al.  Automated localization of affective objects and actions in images via caption text-cum-eye gaze analysis , 2009, ACM Multimedia.

[4]  C. Koch,et al.  A saliency-based search mechanism for overt and covert shifts of visual attention , 2000, Vision Research.

[5]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Chun Chen,et al.  What Is the Chance of Happening: A New Way to Predict Where People Look , 2010, ECCV.

[8]  Douglas DeCarlo,et al.  Robust clustering of eye movement recordings for quantification of visual interest , 2004, ETRA.

[9]  Ramesh C. Jain,et al.  Content without context is meaningless , 2010, ACM Multimedia.

[10]  Laurent Itti,et al.  Interesting objects are visually salient. , 2008, Journal of vision.

[11]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Ariel Shamir,et al.  Seam Carving for Content-Aware Image Resizing , 2007, ACM Trans. Graph..

[13]  Pau-Choo Chung,et al.  Naked image detection based on adaptive and extensible skin color model , 2007, Pattern Recognit..

[14]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[15]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[16]  P. Perona,et al.  Objects predict fixations better than early saliency. , 2008, Journal of vision.

[17]  Allan Hanbury,et al.  Affective image classification using features inspired by psychology and art theory , 2010, ACM Multimedia.

[18]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[19]  A. L. I︠A︡rbus Eye Movements and Vision , 1967 .

[20]  Harish Katti,et al.  An Eye Fixation Database for Saliency Detection in Images , 2010, ECCV.

[21]  Gretchen Kambe,et al.  Detection of Differential Viewing Patterns to Erotic and Non-Erotic Stimuli Using Eye-Tracking Methodology , 2006, Archives of sexual behavior.

[22]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[23]  Larry S. Davis,et al.  Why Did the Person Cross the Road (There)? Scene Understanding Using Probabilistic Logic Models and Common Sense Reasoning , 2010, ECCV.

[24]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[25]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[26]  Nicu Sebe,et al.  Webcam-Based Visual Gaze Estimation , 2009, ICIAP.

[27]  A. L. Yarbus,et al.  Eye Movements and Vision , 1967, Springer US.

[28]  J. Henderson Human gaze control during real-world scene perception , 2003, Trends in Cognitive Sciences.

[29]  A. Kingstone,et al.  Saliency does not account for fixations to eyes within social scenes , 2009, Vision Research.

[30]  Luc Van Gool,et al.  The 2005 PASCAL Visual Object Classes Challenge , 2005, MLCW.

[31]  Christopher M. Masciocchi,et al.  Everyone knows what is interesting: salient locations which should be fixated. , 2009, Journal of vision.

[32]  Antonio Criminisi,et al.  TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context , 2007, International Journal of Computer Vision.

[33]  Andrew Whiten,et al.  Evolutionary and developmental origins of the mindreading system. , 1998 .

[34]  Pietro Perona,et al.  Measuring and Predicting Object Importance , 2011, International Journal of Computer Vision.

[35]  Nicu Sebe,et al.  Image saliency by isocentric curvedness and color , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[36]  Christof Koch,et al.  Learning a saliency map using fixated locations in natural scenes. , 2011, Journal of vision.

[37]  Alan Hanjalic,et al.  Affective video content representation and modeling , 2005, IEEE Transactions on Multimedia.

[38]  Jean Ponce,et al.  Computer Vision: A Modern Approach , 2002 .

[39]  Zhou Wang,et al.  Foveation scalable video coding with automatic fixation selection , 2003, IEEE Trans. Image Process..

[40]  Qiang Ji,et al.  Knowledge Based Activity Recognition with Dynamic Bayesian Network , 2010, ECCV.