论文信息 - MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking

MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking

In this work, we propose a novel Weakly Supervised Learning (WSL) framework dedicated to learn discriminative part detectors from images annotated with a global label. Our WSL method encompasses three main contributions. Firstly, we introduce a new structured output latent variable model, Minimum mAximum lateNt sTRucturAl SVM (MANTRA), which prediction relies on a pair of latent variables: h+ (resp. h-) provides positive (resp. negative) evidence for a given output y. Secondly, we instantiate MANTRA for two different visual recognition tasks: multi-class classification and ranking. For ranking, we propose efficient solutions to exactly solve the inference and the loss-augmented problems. Finally, extensive experiments highlight the relevance of the proposed method: MANTRA outperforms state-of-the art results on five different datasets.

[1] Gang Wang,et al. Learning Discriminative and Shareable Features for Scene Classification , 2014, ECCV.

[2] Fereshteh Sadeghi,et al. Latent Pyramidal Regions for Recognizing Scenes , 2012, ECCV.

[3] Thorsten Joachims,et al. Cutting-plane training of structural SVMs , 2009, Machine Learning.

[4] Matthieu Cord,et al. Learning Deep Hierarchical Visual Feature Coding , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[5] Fei-Fei Li,et al. Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6] Matthieu Cord,et al. Pooling in image representation: The visual codeword point of view , 2013, Comput. Vis. Image Underst..

[7] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] C. V. Jawahar,et al. Efficient Optimization for Average Precision SVM , 2014, NIPS.

[9] Matthieu Cord,et al. Fantope Regularization in Metric Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10] Cordelia Schmid,et al. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[11] C. V. Jawahar,et al. Blocks That Shout: Distinctive Parts for Scene Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12] C. V. Jawahar,et al. Optimizing Average Precision Using Weakly Supervised Data , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Jean Ponce,et al. Learning Discriminative Part Detectors for Image Classification and Cosegmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[14] Svetlana Lazebnik,et al. Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[15] Bolei Zhou,et al. Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[16] Thorsten Joachims,et al. Learning structural SVMs with latent variables , 2009, ICML '09.

[17] Thomas Serre,et al. Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Matthieu Cord,et al. Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Matthieu Cord,et al. Incremental learning of latent structural SVM for weakly supervised image classification , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[20] Fei-Fei Li,et al. What, where and who? Classifying events by scene and object recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[21] Andrew Zisserman,et al. Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[22] Thomas Hofmann,et al. Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[23] Alexei A. Efros,et al. Mid-level Visual Element Discovery as Discriminative Mode Seeking , 2013, NIPS.

[24] Cordelia Schmid,et al. Discriminative spatial saliency for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Ivan Laptev,et al. Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26] Jian Sun,et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Fei-Fei Li,et al. Object-Centric Spatial Pooling for Image Classification , 2012, ECCV.

[29] Thierry Artières,et al. Regularized bundle methods for convex and non-convex risks , 2012, J. Mach. Learn. Res..

[30] Pedro F. Felzenszwalb,et al. Reconfigurable models for scene recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31] Daphne Koller,et al. Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[32] Alan L. Yuille,et al. The Concave-Convex Procedure , 2003, Neural Computation.

[33] Hao Su,et al. Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[34] Trevor Darrell,et al. PANDA: Pose Aligned Networks for Deep Attribute Modeling , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35] Luc Van Gool,et al. Object and Action Classification with Latent Window Parameters , 2013, International Journal of Computer Vision.

[36] Matthieu Cord,et al. Extended Coding and Pooling in the HMAX Model , 2013, IEEE Transactions on Image Processing.

[37] Florent Perronnin,et al. Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[38] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[39] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[40] Svetlana Lazebnik,et al. Multi-scale Orderless Pooling of Deep Convolutional Activation Features , 2014, ECCV.

[41] Filip Radlinski,et al. A support vector method for optimizing average precision , 2007, SIGIR.

[42] Antonio Torralba,et al. Recognizing indoor scenes , 2009, CVPR.