Saccade gaze prediction using a recurrent neural network

We present a model that generates close-to-human gaze sequences for a given image in the free viewing task. The proposed approach leverages recent advances in image recognition using convolutional neural networks and sequence modeling with recurrent neural networks. Feature maps from convolutional neural networks are used as inputs to a recurrent neural network. The recurrent neural network acts like a visual working memory that integrates the scene information and outputs a sequence of saccades. The model is trained end-to-end with real-world human eye-tracking data using back propagation and adaptive stochastic gradient descent. Overall, the proposed model is simple compared to the state-of-the-art methods while offering favorable performance on a standard eye-tracking data set.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Theo Geisel,et al.  The ecology of gaze shifts , 2000, Neurocomputing.

[3]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[4]  Leslie G. Ungerleider,et al.  Mechanisms of visual attention in the human cortex. , 2000, Annual review of neuroscience.

[5]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[6]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2019, Computational Visual Media.

[7]  Cristian Sminchisescu,et al.  Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths , 2013, NIPS.

[8]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  Luc Van Gool,et al.  Learning to Predict Sequences of Human Visual Fixations , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[10]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[11]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[12]  Paul M Bays,et al.  Active inhibition and memory promote exploration and search of natural scenes. , 2012, Journal of vision.

[13]  Giuseppe Boccignone,et al.  Modelling gaze shift as a constrained random walk , 2004 .

[14]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[15]  Tristan Perez,et al.  Fine-Grained Plant Classification Using Convolutional Neural Networks for Feature Extraction , 2014, CLEF.

[16]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Ali Borji,et al.  Analysis of Scores, Datasets, and Models in Visual Saliency Prediction , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Stephen Lin,et al.  Semantically-Based Human Scanpath Estimation with HMMs , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Yuan Yao,et al.  Simulating human saccadic scanpaths on natural images , 2011, CVPR 2011.