Tell Me What You See and I Will Show You Where It Is

We tackle the problem of weakly labeled semantic segmentation, where the only source of annotation are image tags encoding which classes are present in the scene. This is an extremely difficult problem as no pixel-wise labelings are available, not even at training time. In this paper, we show that this problem can be formalized as an instance of learning in a latent structured prediction framework, where the graphical model encodes the presence and absence of a class as well as the assignments of semantic labels to superpixels. As a consequence, we are able to leverage standard algorithms with good theoretical properties. We demonstrate the effectiveness of our approach using the challenging SIFT-flow dataset and show average per-class accuracy improvements of 7% over the state-of-the-art.

[1]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Raquel Urtasun,et al.  Latent Structured Active Learning , 2013, NIPS.

[3]  Svetlana Lazebnik,et al.  Scene recognition and weakly supervised object localization with deformable part-based models , 2011, 2011 International Conference on Computer Vision.

[4]  Marie-Pierre Jolly,et al.  Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images , 2001, ICCV.

[5]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[6]  Joachim M. Buhmann,et al.  Weakly supervised semantic segmentation with a multi-image model , 2011, 2011 International Conference on Computer Vision.

[7]  Vladimir Kolmogorov,et al.  Cosegmentation Revisited: Models and Optimization , 2010, ECCV.

[8]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[9]  Joachim M. Buhmann,et al.  Weakly supervised structured output learning for semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Sanja Fidler,et al.  Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[12]  Pushmeet Kohli,et al.  Graph Cut Based Inference with Co-occurrence Statistics , 2010, ECCV.

[13]  Andrew Blake,et al.  Cosegmentation of Image Pairs by Histogram Matching - Incorporating a Global Constraint into MRFs , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[14]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[15]  Bill Triggs,et al.  Region Classification with Markov Field Aspect Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Jia Xu,et al.  Analyzing the Subspace Structure of Related Images: Concurrent Segmentation of Image Sets , 2012, ECCV.

[17]  Gert R. G. Lanckriet,et al.  On the Convergence of the Concave-Convex Procedure , 2009, NIPS.

[18]  Marc Pollefeys,et al.  Distributed message passing for large scale graphical models , 2011, CVPR 2011.

[19]  Svetlana Lazebnik,et al.  Superparsing - Scalable Nonparametric Image Parsing with Superpixels , 2010, International Journal of Computer Vision.

[20]  Trevor Darrell,et al.  Hidden-state Conditional Random Fields , 2006 .

[21]  Marie-Pierre Jolly,et al.  Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[22]  Jean Ponce,et al.  Multi-class cosegmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Joachim M. Buhmann,et al.  Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Marc Pollefeys,et al.  Efficient Structured Prediction with Latent Variables for General Graphical Models , 2012, ICML.

[25]  Antonio Torralba,et al.  Nonparametric Scene Parsing via Label Transfer , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[27]  Cristian Sminchisescu,et al.  Object Recognition by Sequential Figure-Ground Ranking , 2011, International Journal of Computer Vision.

[28]  Eric P. Xing,et al.  On multiple foreground cosegmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31]  Trevor Darrell,et al.  Hidden Conditional Random Fields , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Jitendra Malik,et al.  Recognition using regions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Svetlana Lazebnik,et al.  Finding Things: Image Parsing with Regions and Per-Exemplar Detectors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Vikas Singh,et al.  Scale invariant cosegmentation for image groups , 2011, CVPR 2011.

[35]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Joost van de Weijer,et al.  Harmony potentials for joint classification and segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.