Learning to segment with image-level annotations

Recently, deep convolutional neural networks (DCNNs) have significantly promoted the development of semantic image segmentation. However, previous works on learning the segmentation network often rely on a large number of ground-truths with pixel-level annotations, which usually require considerable human effort. In this paper, we explore a more challenging problem by learning to segment under image-level annotations. Specifically, our framework consists of two components. First, reliable hypotheses based localization maps are generated by incorporating the hypotheses-aware classification and cross-image contextual refinement. Second, the segmentation network can be trained in a supervised manner by these generated localization maps. We explore two network training strategies for achieving good segmentation performance. For the first strategy, a novel multi-label cross-entropy loss is proposed to train the network by directly using multiple localization maps for all classes, where each pixel contributes to each class with different weights. For the second strategy, the rough segmentation mask can be inferred from the localization maps, and then the network is optimized based on the single-label cross-entropy loss with the produced masks. We evaluate our methods on the PASCAL VOC 2012 segmentation benchmark. Extensive experimental results demonstrate the effectiveness of the proposed methods compared with the state-of-the-arts. HighlightsLocalization map generation is proposed by using the hypothesis-based classification.A novel multi-label loss is proposed to train the network based on localization maps.An effective method is proposed to predict the rough mask of the given training image.Our methods achieve new state-of-the-art results on PASCAL VOC 2012 benchmark.

[1]  Yunchao Wei,et al.  Deep Learning with S-Shaped Rectified Linear Activation Units , 2015, AAAI.

[2]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[4]  Jia Xu,et al.  Learning to segment under various forms of weak supervision , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Ronan Collobert,et al.  From image-level to pixel-level labeling with Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Guosheng Lin,et al.  CRF Learning with CNN Features for Image Segmentation , 2015, Pattern Recognit..

[8]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Jian Sun,et al.  Convolutional feature masking for joint object and stuff segmentation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Joachim M. Buhmann,et al.  Towards weakly supervised semantic segmentation by means of multiple instance and multitask learning , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Jian-Huang Lai,et al.  Discriminatively Trained And-Or Graph Models for Object Shape Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Sanja Fidler,et al.  Beat the MTurkers: Automatic Image Labeling from Weak 3D Supervision , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[15]  Jian Dong,et al.  Semantic Segmentation without Annotating Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Jun Zhu,et al.  Learning From Weakly Supervised Data by The Expectation Loss SVM (e-SVM) algorithm , 2014, NIPS.

[21]  Joachim M. Buhmann,et al.  Weakly supervised semantic segmentation with a multi-image model , 2011, 2011 International Conference on Computer Vision.

[22]  Jian-Huang Lai,et al.  Data-Driven Scene Understanding with Adaptively Retrieved Exemplars , 2015, IEEE MultiMedia.

[23]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[25]  George Papandreou,et al.  Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation , 2015, ArXiv.

[26]  Gang Wang,et al.  Exemplar based Deep Discriminative and Shareable Feature Learning for scene image classification , 2015, Pattern Recognit..

[27]  Ran Xu,et al.  Human action segmentation with hierarchical supervoxel consistency , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jian Sun,et al.  BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[30]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[31]  Trevor Darrell,et al.  Constrained Convolutional Neural Networks for Weakly Supervised Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Jia Xu,et al.  Tell Me What You See and I Will Show You Where It Is , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Xiaodan Liang,et al.  Human Parsing with Contextualized Convolutional Neural Network. , 2017, IEEE transactions on pattern analysis and machine intelligence.

[34]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[35]  Matthieu Guillaumin,et al.  ImageNet Auto-Annotation with Segmentation Propagation , 2014, International Journal of Computer Vision.

[36]  Liang Lin,et al.  Clothing Co-parsing by Joint Image Segmentation and Labeling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Bill Triggs,et al.  Region Classification with Markov Field Aspect Models , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  George Papandreou,et al.  Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Trevor Darrell,et al.  Fully Convolutional Multi-Class Multiple Instance Learning , 2014, ICLR.