Gaze latent support vector machine for image classification

This paper deals with image categorization from weak supervision, e.g. global image labels. We propose to improve the region selection performed in latent variable models such as Latent Support Vector Machine (LSVM) by leveraging human eye movement features collected from an eye-tracker device. We introduce a new model, Gaze Latent Support Vector Machine (G-LSVM), whose region selection during training is biased toward regions with a large gaze density ratio. On this purpose, the training objective is enriched with a gaze loss, from which we derive a convex upper bound, leading to a Concave-Convex Procedure (CCCP) optimization scheme. Experiments show that G-LSVM significantly outperforms LSVM in both object detection and action recognition problems on PASCAL VOC 2012. We also show that our G-LSVM is even slightly better than a model trained from bounding box annotations, while gaze labels are much cheaper to collect.

[1]  Gregory J. Zelinsky,et al.  Action classification in still images using human eye movements , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[2]  James M. Rehg,et al.  Learning to Recognize Daily Actions Using Gaze , 2012, ECCV.

[3]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4]  Pushmeet Kohli,et al.  Robust Higher Order Potentials for Enforcing Label Consistency , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Frank Keller,et al.  Training Object Class Detectors from Eye Tracking Data , 2014, ECCV.

[7]  Frédéric Precioso,et al.  One gaze is worth ten thousand (key-)words , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[8]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[9]  Matthieu Cord,et al.  LR-CNN for fine-grained classification with varying resolution , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[10]  Mario Fritz,et al.  GazeDPM: Early Integration of Gaze Information in Deformable Part Models , 2015, ArXiv.

[11]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[12]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[13]  Fei-Fei Li,et al.  Object-Centric Spatial Pooling for Image Classification , 2012, ECCV.

[14]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Nuno Vasconcelos,et al.  Multiple instance learning for soft bags via top instances , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[18]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[19]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[20]  Matthieu Cord,et al.  Recipe recognition with large multimodal food dataset , 2015, 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[21]  Luc Van Gool,et al.  Object and Action Classification with Latent Window Parameters , 2013, International Journal of Computer Vision.

[22]  Greg Mori,et al.  Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization , 2013, NIPS.

[23]  Matthieu Cord,et al.  MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[25]  B. S. Manjunath,et al.  From Where and How to What We See , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Matthieu Cord,et al.  WELDON: Weakly Supervised Learning of Deep Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Matthieu Cord,et al.  Incremental learning of latent structural SVM for weakly supervised image classification , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[29]  Jean Ponce,et al.  Learning Discriminative Part Detectors for Image Classification and Cosegmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[30]  Matthieu Cord,et al.  Image classification using object detectors , 2013, 2013 IEEE International Conference on Image Processing.

[31]  Cristian Sminchisescu,et al.  Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths , 2013, NIPS.