Gaze Embeddings for Zero-Shot Image Classification

Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Andrew Zisserman,et al.  Learning Visual Attributes , 2007, NIPS.

[3]  Hong Va Leong,et al.  ScreenGlint: Practical, In-situ Gaze Estimation on Smartphones , 2017, CHI.

[4]  DengJia,et al.  Leveraging the Wisdom of the Crowd for Fine-Grained Recognition , 2016 .

[5]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Pietro Perona,et al.  Is bottom-up attention useful for object recognition? , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[7]  J Hyönä,et al.  Pupil Dilation as a Measure of Processing Load in Simultaneous Interpretation and Other Language Tasks , 1995, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[8]  Silvio Savarese,et al.  Recognizing human actions by attributes , 2011, CVPR 2011.

[9]  Bernt Schiele,et al.  Latent Embeddings for Zero-Shot Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Wen Gao,et al.  A dataset and evaluation methodology for visual saliency in video , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Subramanian Ramanathan,et al.  Can computers learn from humans to see better?: inferring scene semantics from viewers' eye movements , 2011, ACM Multimedia.

[13]  David J. Fleet,et al.  Human attributes from 3D pose tracking , 2010, Comput. Vis. Image Underst..

[14]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Shree K. Nayar,et al.  Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17]  Ali Farhadi,et al.  Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  James M. Rehg,et al.  The Secrets of Salient Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Pietro Perona,et al.  Is bottom-up attention useful for object recognition? , 2004, CVPR 2004.

[20]  Jonathan Krause,et al.  Leveraging the Wisdom of the Crowd for Fine-Grained Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[22]  Mario Fritz,et al.  GazeDPM: Early Integration of Gaze Information in Deformable Part Models , 2015, ArXiv.

[23]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[24]  Loong Fah Cheong,et al.  Active segmentation with fixation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25]  Gang Wang,et al.  Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[26]  Mario Fritz,et al.  Prediction of search targets from fixations in open-world settings , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Frank Keller,et al.  Training Object Class Detectors from Eye Tracking Data , 2014, ECCV.

[28]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[29]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[30]  Larry S. Davis,et al.  Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[31]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[32]  Yifan Peng,et al.  Studying Relationships between Human Gaze, Description, and Computer Vision , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Rama Chellappa,et al.  Sparse dictionary-based representation and recognition of action attributes , 2011, 2011 International Conference on Computer Vision.

[34]  Frédéric Jurie,et al.  Learning Saliency Maps for Object Categorization , 2006 .

[35]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[36]  B. S. Manjunath,et al.  From Where and How to What We See , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[38]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Cristian Sminchisescu,et al.  Multiple Instance Reinforcement Learning for Efficient Weakly-Supervised Detection in Images , 2014, ArXiv.

[40]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Joseph H. Goldberg,et al.  Identifying fixations and saccades in eye-tracking protocols , 2000, ETRA.

[43]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.

[44]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[45]  B. S. Manjunath,et al.  Eye tracking assisted extraction of attentionally important objects from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yusuke Sugano,et al.  Seeing with Humans: Gaze-Assisted Neural Image Captioning , 2016, ArXiv.

[47]  Yang Wang,et al.  A Discriminative Latent Model of Object Classes and Attributes , 2010, ECCV.

[48]  Svetlana Lazebnik,et al.  Comparing data-dependent and data-independent embeddings for classification and ranking of Internet images , 2011, CVPR 2011.

[49]  Cristian Sminchisescu,et al.  Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose? , 2013, 2013 IEEE International Conference on Computer Vision.

[50]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[51]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Bernt Schiele,et al.  Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[53]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[54]  Jonathan Krause,et al.  Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.