MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking

In this work, we propose a novel Weakly Supervised Learning (WSL) framework dedicated to learn discriminative part detectors from images annotated with a global label. Our WSL method encompasses three main contributions. Firstly, we introduce a new structured output latent variable model, Minimum mAximum lateNt sTRucturAl SVM (MANTRA), which prediction relies on a pair of latent variables: h+ (resp. h-) provides positive (resp. negative) evidence for a given output y. Secondly, we instantiate MANTRA for two different visual recognition tasks: multi-class classification and ranking. For ranking, we propose efficient solutions to exactly solve the inference and the loss-augmented problems. Finally, extensive experiments highlight the relevance of the proposed method: MANTRA outperforms state-of-the art results on five different datasets.

[1]  Gang Wang,et al.  Learning Discriminative and Shareable Features for Scene Classification , 2014, ECCV.

[2]  Fereshteh Sadeghi,et al.  Latent Pyramidal Regions for Recognizing Scenes , 2012, ECCV.

[3]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[4]  Matthieu Cord,et al.  Learning Deep Hierarchical Visual Feature Coding , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Matthieu Cord,et al.  Pooling in image representation: The visual codeword point of view , 2013, Comput. Vis. Image Underst..

[7]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  C. V. Jawahar,et al.  Efficient Optimization for Average Precision SVM , 2014, NIPS.

[9]  Matthieu Cord,et al.  Fantope Regularization in Metric Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11]  C. V. Jawahar,et al.  Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  C. V. Jawahar,et al.  Optimizing Average Precision Using Weakly Supervised Data , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Jean Ponce,et al.  Learning Discriminative Part Detectors for Image Classification and Cosegmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[15]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[16]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[17]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Matthieu Cord,et al.  Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Matthieu Cord,et al.  Incremental learning of latent structural SVM for weakly supervised image classification , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[20]  Fei-Fei Li,et al.  What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[22]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[23]  Alexei A. Efros,et al.  Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[24]  Cordelia Schmid,et al.  Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Fei-Fei Li,et al.  Object-Centric Spatial Pooling for Image Classification , 2012, ECCV.

[29]  Thierry Artières,et al.  Regularized bundle methods for convex and non-convex risks , 2012, J. Mach. Learn. Res..

[30]  Pedro F. Felzenszwalb,et al.  Reconfigurable models for scene recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[32]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[33]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[34]  Trevor Darrell,et al.  PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Luc Van Gool,et al.  Object and Action Classification with Latent Window Parameters , 2013, International Journal of Computer Vision.

[36]  Matthieu Cord,et al.  Extended Coding and Pooling in the HMAX Model , 2013, IEEE Transactions on Image Processing.

[37]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[40]  Svetlana Lazebnik,et al.  Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[41]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[42]  Antonio Torralba,et al.  Recognizing indoor scenes , 2009, CVPR.