Predicting the Category and Attributes of Mental Pictures Using Deep Gaze Pooling

Previous work focused on predicting visual search targets from human fixations but, in the real world, a specific target is often not known, e.g. when searching for a present for a friend. In this work we instead study the problem of predicting the mental picture, i.e. only an abstract idea instead of a specific target. This task is significantly more challenging given that mental pictures of the same target category can vary widely depending on personal biases, and given that characteristic target attributes can often not be verbalised explicitly. We instead propose to use gaze information as implicit information on users’ mental picture and present a novel gaze pooling layer to seamlessly integrate semantic and localized fixation information into a deep image representation. We show that we can robustly predict both the mental picture’s category as well as attributes on a novel dataset containing fixation data of 14 users searching for targets on a subset of the DeepFahion dataset. Our results have important implications for future search interfaces and suggest deep gaze pooling as a general-purpose approach for gaze-supported computer vision systems.

[1]  S. Kosslyn,et al.  Eye movements during visual mental imagery , 2002, Trends in Cognitive Sciences.

[2]  Andreas Bulling,et al.  Discovery of everyday human activities from long-term visual behaviour using topic models , 2015, UbiComp.

[3]  Jorma Laaksonen,et al.  Pinview: Implicit Feedback in Content-Based Image Retrieval , 2010, WAPA.

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Loong Fah Cheong,et al.  Active segmentation with fixation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[6]  C. Kleinke Gaze and eye contact: a research review. , 1986, Psychological bulletin.

[7]  L. Itti,et al.  Defending Yarbus: eye movements reveal observers' task. , 2014, Journal of vision.

[8]  Frederick J. Brigham,et al.  The eyes may have it:Reconsidering eye-movement research in human cognition , 2001 .

[9]  B. S. Manjunath,et al.  From Where and How to What We See , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  MarinoiuElisabeta,et al.  Pictorial Human Spaces , 2016 .

[12]  James J. Clark,et al.  A computational model for task inference in visual search. , 2013, Journal of vision.

[13]  Mario Fritz,et al.  Appearance-based gaze estimation in the wild , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Manjeet Rege,et al.  Gaze-based image retrieval system using dual eye-trackers , 2012, 2012 IEEE International Conference on Emerging Signal Processing Applications.

[15]  Ali Farhadi,et al.  Multi-attribute Queries: To Merge or Not to Merge? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  A. L. Yarbus,et al.  Eye Movements and Vision , 1967, Springer US.

[17]  Adriana Kovashka,et al.  WhittleSearch: Interactive Image Search with Relative Attribute Feedback , 2015, International Journal of Computer Vision.

[18]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[19]  James M. Rehg,et al.  The Secrets of Salient Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Ali Borji,et al.  What do eyes reveal about the mind?: Algorithmic inference of search targets from fixations , 2015, Neurocomputing.

[21]  Gerhard Tröster,et al.  Eye Movement Analysis for Activity Recognition Using Electrooculography , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Cristian Sminchisescu,et al.  Multiple Instance Reinforcement Learning for Efficient Weakly-Supervised Detection in Images , 2014, ArXiv.

[23]  Subramanian Ramanathan,et al.  Can computers learn from humans to see better?: inferring scene semantics from viewers' eye movements , 2011, ACM Multimedia.

[24]  Shuo Wang,et al.  Predicting human gaze beyond pixels. , 2014, Journal of vision.

[25]  P. Perona,et al.  Objects predict fixations better than early saliency. , 2008, Journal of vision.

[26]  Yusuke Sugano,et al.  Seeing with Humans: Gaze-Assisted Neural Image Captioning , 2016, ArXiv.

[27]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Shih-Fu Chang,et al.  CuZero: embracing the frontier of interactive visual search for informed users , 2008, MIR '08.

[29]  M F Land,et al.  The knowledge base of the oculomotor system. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[30]  G. Zelinsky,et al.  Eye can read your mind: decoding gaze fixations to reveal categorical search targets. , 2013, Journal of vision.

[31]  Andreas Bulling,et al.  Cognition-Aware Computing , 2014, IEEE Pervasive Computing.

[32]  Mario Fritz,et al.  Prediction of search targets from fixations in open-world settings , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  A. Bovik,et al.  Visual search in noise: revealing the influence of structural cues by gaze-contingent classification image analysis. , 2006, Journal of vision.

[34]  James J. Clark,et al.  An inverse Yarbus process: Predicting observers’ task from eye movement patterns , 2014, Vision Research.

[35]  Terrance E. Boult,et al.  Multi-attribute spaces: Calibration for attribute fusion and similarity search , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  J. Henderson,et al.  Object-based attentional selection in scene viewing. , 2010, Journal of vision.

[37]  Dima Damen,et al.  You-Do, I-Learn: Discovering Task Relevant Objects and their Modes of Interaction from Multi-User Egocentric Video , 2014, BMVC.

[38]  Cristian Sminchisescu,et al.  Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose? , 2013, 2013 IEEE International Conference on Computer Vision.

[39]  Thomas Kieninger,et al.  Gaze guided object recognition using a head-mounted eye tracker , 2012, ETRA '12.

[40]  O. Schwartz,et al.  Visuomotor characterization of eye movements in a drawing task , 2009, Vision Research.

[41]  Petros Daras,et al.  Gaze-Based Relevance Feedback for Realizing Region-Based Image Retrieval , 2014, IEEE Transactions on Multimedia.

[42]  Feng Liu,et al.  Sketch Me That Shoe , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).