Deep Spatio-Temporal Modeling for Object-Level Gaze-Based Relevance Assessment

The current work investigates the problem of objectlevel relevance assessment prediction, taking into account the user’s captured gaze signal (behaviour) and following the Deep Learning (DL) paradigm. Human gaze, as a sub-conscious response, is influenced from several factors related to the human mental activity. Several studies have so far proposed methodologies based on the use of gaze statistical modeling and naive classifiers for assessing images or image patches as relevant or not to the user’s interests. Nevertheless, the outstanding majority of literature approaches only relied so far on the use of handcrafted features and relative simple classification schemes. On the contrary, the current work focuses on the use of DL schemes that will enable the modeling of complex patterns in the captured gaze signal and the subsequent derivation of corresponding discriminant features. Novel contributions of this study include: a) the introduction of a large-scale annotated gaze dataset, suitable for training DL models, b) a novel method for gaze modeling, capable of handling gaze sensor errors, and c) a DL based method, able to capture gaze patterns for assessing image objects as relevant or non-relevant, with respect to the user’s preferences. Extensive experiments demonstrate the efficiency of the proposed method, taking also into consideration key factors related to the human gaze behaviour.

[1]  Ali Borji,et al.  CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research , 2015, ArXiv.

[2]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[3]  Samuel Kaski,et al.  Inferring object relevance from gaze in dynamic scenes , 2010, ETRA.

[4]  Petros Daras,et al.  Gaze-Based Relevance Feedback for Realizing Region-Based Image Retrieval , 2014, IEEE Transactions on Multimedia.

[5]  Patrick Le Callet,et al.  Visual attention, visual salience, and perceived interest in multimedia applications , 2018 .

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Qi Zhao,et al.  Saliency in Crowd , 2014, ECCV.

[8]  Jorma Laaksonen,et al.  Gaze- and Speech-Enhanced Content-Based Image Retrieval in Image Tagging , 2011, ICANN.

[9]  Shuo Wang,et al.  Predicting human gaze beyond pixels. , 2014, Journal of vision.

[10]  Patrick Le Callet,et al.  Quantifying the relation between perceived interest and visual salience during free viewing using trellis based optimization , 2016, 2016 IEEE 12th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP).

[11]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[12]  Zheru Chi,et al.  Content-based image retrieval based on eye-tracking , 2018, COGAIN@ETRA.

[13]  Samuel Kaski,et al.  GaZIR: gaze-based zooming interface for image retrieval , 2009, ICMI-MLMI '09.

[14]  Matthieu Cord,et al.  Gaze latent support vector machine for image classification , 2016, ICIP.

[15]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[16]  Petros Daras,et al.  Deep Affordance-Grounded Sensorimotor Object Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Patrick Le Callet,et al.  Perceived interest and overt visual attention in natural images , 2015, Signal Process. Image Commun..

[18]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[19]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  A. Torralba,et al.  Intrinsic and extrinsic effects on image memorability , 2015, Vision Research.