Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition

Systems based on bag-of-words models operating on image features collected at maxima of sparse interest point operators have been extremely successful for both computer-based visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in "saccade and fixate" regimes, the knowledge, methodology, and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large-scale dynamic computer vision datasets like Hollywood-2[1] and UCF Sports[2] with human eye movements collected under the ecological constraints of the visual action recognition task. To our knowledge these are the first massive human eye tracking datasets of significant size to be collected for video (497,107 frames, each viewed by 16 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing. Second, we introduce novel dynamic consistency and alignment models, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the massive amounts of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the most advanced computer vision practice, can lead to state of the art results.

[1]  A. L. I︠A︡rbus Eye Movements and Vision , 1967 .

[2]  A. L. Yarbus,et al.  Eye Movements and Vision , 1967, Springer US.

[3]  William T. Freeman,et al.  Presented at: 2nd Annual IEEE International Conference on Image , 1995 .

[4]  R. Rosenholtz A simple saliency model predicts a number of motion popout phenomena , 1999, Vision Research.

[5]  C. Koch,et al.  A saliency-based search mechanism for overt and covert shifts of visual attention , 2000, Vision Research.

[6]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[7]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[8]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[9]  John K. Tsotsos,et al.  Neurobiology of Attention , 2005 .

[10]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[11]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[12]  Bernhard Schölkopf,et al.  How to Find Interesting Locations in Video: A Spatiotemporal Interest Point Detector Learned from Human Eye Movements , 2007, DAGM-Symposium.

[13]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  P. Perona,et al.  What do we perceive in a glance of a real-world scene? , 2007, Journal of vision.

[15]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Zicheng Liu,et al.  Expandable Data-Driven Graphical Modeling of Human Actions Based on Salient Postures , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[19]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[20]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[21]  Dong Han,et al.  Selection and context for action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Krista A. Ehinger,et al.  Modelling search for people in 900 scenes: A combined source model of eye guidance , 2009 .

[23]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[24]  Laurent Itti,et al.  A Bayesian model for efficient visual search and recognition , 2010, Vision Research.

[25]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[26]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[27]  A. Torralba,et al.  Fixations on low-resolution images. , 2010, Journal of vision.

[28]  Alex D. Hwang,et al.  Semantic guidance of eye movements in real-world scenes , 2011, Vision Research.

[29]  Ali Borji,et al.  Scene classification with a sparse set of salient regions , 2011, 2011 IEEE International Conference on Robotics and Automation.

[30]  Cordelia Schmid,et al.  Weakly Supervised Learning of Interactions between Humans and Objects , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Michael Dorr,et al.  Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements , 2012, ECCV.

[32]  Cristian Sminchisescu,et al.  Chebyshev approximations to the histogram χ2 kernel , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Fuxin Li,et al.  Chebyshev Approximations to the Histogram $\chi^2$ Kernel , 2012 .