论文信息 - Gaze Embeddings for Zero-Shot Image Classification

Gaze Embeddings for Zero-Shot Image Classification

Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification.

[1] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2] Andrew Zisserman,et al. Learning Visual Attributes , 2007, NIPS.

[3] Hong Va Leong,et al. ScreenGlint: Practical, In-situ Gaze Estimation on Smartphones , 2017, CHI.

[4] DengJia,et al. Leveraging the Wisdom of the Crowd for Fine-Grained Recognition , 2016 .

[5] Bernt Schiele,et al. Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Pietro Perona,et al. Is bottom-up attention useful for object recognition? , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[7] J Hyönä,et al. Pupil Dilation as a Measure of Processing Load in Simultaneous Interpretation and Other Language Tasks , 1995, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[8] Silvio Savarese,et al. Recognizing human actions by attributes , 2011, CVPR 2011.

[9] Bernt Schiele,et al. Latent Embeddings for Zero-Shot Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Wen Gao,et al. A dataset and evaluation methodology for visual saliency in video , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[11] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12] Subramanian Ramanathan,et al. Can computers learn from humans to see better?: inferring scene semantics from viewers' eye movements , 2011, ACM Multimedia.

[13] David J. Fleet,et al. Human attributes from 3D pose tracking , 2010, Comput. Vis. Image Underst..

[14] Christoph H. Lampert,et al. Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] C. V. Jawahar,et al. Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Shree K. Nayar,et al. Attribute and simile classifiers for face verification , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[17] Ali Farhadi,et al. Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18] James M. Rehg,et al. The Secrets of Salient Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Pietro Perona,et al. Is bottom-up attention useful for object recognition? , 2004, CVPR 2004.

[20] Jonathan Krause,et al. Leveraging the Wisdom of the Crowd for Fine-Grained Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[22] Mario Fritz,et al. GazeDPM: Early Integration of Gaze Information in Deformable Part Models , 2015, ArXiv.

[23] Pietro Perona,et al. Graph-Based Visual Saliency , 2006, NIPS.

[24] Loong Fah Cheong,et al. Active segmentation with fixation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[25] Gang Wang,et al. Joint learning of visual attributes, object classes and visual saliency , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[26] Mario Fritz,et al. Prediction of search targets from fixations in open-world settings , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Frank Keller,et al. Training Object Class Detectors from Eye Tracking Data , 2014, ECCV.

[28] Geoffrey E. Hinton,et al. Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[29] Pietro Perona,et al. Caltech-UCSD Birds 200 , 2010 .

[30] Larry S. Davis,et al. Image ranking and retrieval based on multi-attribute queries , 2011, CVPR 2011.

[31] Thomas Hofmann,et al. Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[32] Yifan Peng,et al. Studying Relationships between Human Gaze, Description, and Computer Vision , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Rama Chellappa,et al. Sparse dictionary-based representation and recognition of action attributes , 2011, 2011 International Conference on Computer Vision.

[34] Frédéric Jurie,et al. Learning Saliency Maps for Object Categorization , 2006 .

[35] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[36] B. S. Manjunath,et al. From Where and How to What We See , 2013, 2013 IEEE International Conference on Computer Vision.

[37] Andrew Y. Ng,et al. Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[38] Cordelia Schmid,et al. Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39] Cristian Sminchisescu,et al. Multiple Instance Reinforcement Learning for Efficient Weakly-Supervised Detection in Images , 2014, ArXiv.

[40] Bernt Schiele,et al. Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Cordelia Schmid,et al. Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42] Joseph H. Goldberg,et al. Identifying fixations and saccades in eye-tracking protocols , 2000, ETRA.

[43] Kristen Grauman,et al. Relative attributes , 2011, 2011 International Conference on Computer Vision.

[44] Leonidas J. Guibas,et al. Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[45] B. S. Manjunath,et al. Eye tracking assisted extraction of attentionally important objects from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Yusuke Sugano,et al. Seeing with Humans: Gaze-Assisted Neural Image Captioning , 2016, ArXiv.

[47] Yang Wang,et al. A Discriminative Latent Model of Object Classes and Attributes , 2010, ECCV.

[48] Svetlana Lazebnik,et al. Comparing data-dependent and data-independent embeddings for classification and ranking of Internet images , 2011, CVPR 2011.

[49] Cristian Sminchisescu,et al. Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose? , 2013, 2013 IEEE International Conference on Computer Vision.

[50] Frédo Durand,et al. Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[51] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Bernt Schiele,et al. Evaluating knowledge transfer and zero-shot learning in a large-scale setting , 2011, CVPR 2011.

[53] Jason Weston,et al. WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[54] Jonathan Krause,et al. Fine-Grained Crowdsourcing for Fine-Grained Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.