论文信息 - Learning Semantic Segmentation with Weakly-Annotated Videos

Learning Semantic Segmentation with Weakly-Annotated Videos

Fully convolutional neural networks (FCNNs) trained on a large number of images with strong pixel-level annotations have become the new state of the art for the semantic segmentation task. While there have been recent attempts to learn FCNNs from image-level weak annotations , they need additional constraints, such as the size of an object, to obtain reasonable performance. To address this issue, we present motion-CNN (M-CNN), a novel FCNN framework which incorporates motion cues and is learned from video-level weak annotations. Our learning scheme to train the network uses motion segments as soft constraints, thereby handling noisy motion information. When trained on weakly-annotated videos, our method outperforms the state-of-the-art EM-Adapt approach on the PASCAL VOC 2012 image segmentation benchmark. We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images. Finally, M-CNN substantially outperforms recent approaches in a related task of video co-localization on the YouTube-Objects dataset. This is an extended version of our ECCV 2016 paper.

[1] Zhuowen Tu,et al. MILCut: A Sweeping Line Multiple Instance Learning Paradigm for Interactive Image Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2] Yunchao Wei,et al. Towards Computational Baby Learning: A Weakly-Supervised Approach for Object Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3] Camille Couprie,et al. Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Andrew Blake,et al. "GrabCut" , 2004, ACM Trans. Graph..

[5] Fei-Fei Li,et al. Object-Centric Spatial Pooling for Image Classification , 2012, ECCV.

[6] Iasonas Kokkinos,et al. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[7] Pascal Fua,et al. SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8] Cristian Sminchisescu,et al. Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[9] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10] Jianguo Zhang,et al. The PASCAL Visual Object Classes Challenge , 2006 .

[11] Vittorio Ferrari,et al. Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[12] Fei-Fei Li,et al. Efficient Image and Video Co-localization with Frank-Wolfe Algorithm , 2014, ECCV.

[13] Xinlei Chen,et al. NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[14] Ali Farhadi,et al. Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Jitendra Malik,et al. Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[16] Olga Veksler,et al. Fast Approximate Energy Minimization via Graph Cuts , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[17] Karteek Alahari,et al. Weakly-Supervised Semantic Segmentation Using Motion Cues , 2016, ECCV.

[18] Björn Ommer,et al. Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping , 2012, ECCV.

[19] Marie-Pierre Jolly,et al. Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[20] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[22] Fei-Fei Li,et al. Discriminative Segment Annotation in Weakly Labeled Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[24] James M. Rehg,et al. Weakly Supervised Learning of Object Segmentations from Web-Scale Video , 2012, ECCV Workshops.

[25] Ronan Collobert,et al. From image-level to pixel-level labeling with Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Cordelia Schmid,et al. Multi-fold MIL Training for Weakly Supervised Object Localization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27] Cordelia Schmid,et al. Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Vibhav Vineet,et al. Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29] Guosheng Lin,et al. Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Gregory Shakhnarovich,et al. Feedforward semantic segmentation with zoom-out features , 2014, CVPR.

[31] Trevor Darrell,et al. Constrained Convolutional Neural Networks for Weakly Supervised Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[32] Cristian Sminchisescu,et al. CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Jean Ponce,et al. Unsupervised Object Discovery and Tracking in Video Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34] George Papandreou,et al. Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation , 2015, ArXiv.

[35] Trevor Darrell,et al. Fully Convolutional Multi-Class Multiple Instance Learning , 2014, ICLR.

[36] Subhransu Maji,et al. Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[37] Joachim M. Buhmann,et al. Weakly supervised structured output learning for semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38] Andrew Blake,et al. Cosegmentation of Image Pairs by Histogram Matching - Incorporating a Global Constraint into MRFs , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[39] Xinlei Chen,et al. Webly Supervised Learning of Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).