论文信息 - DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

We introduce DiscoBox, a novel framework that jointly learns instance segmentation and semantic correspondence using bounding box supervision. Specifically, we propose a self-ensembling framework where instance segmentation and semantic correspondence are jointly guided by a structured teacher in addition to the bounding box supervision. The teacher is a structured energy model incorporating a pairwise potential and a cross-image potential to model the pairwise pixel relationships both within and across the boxes. Minimizing the teacher energy simultaneously yields refined object masks and dense correspondences between intra-class objects, which are taken as pseudo-labels to supervise the task network and provide positive/negative correspondence pairs for dense contrastive learning. We show a symbiotic relationship where the two tasks mutually benefit from each other. Our best model achieves 37.9% AP on COCO instance segmentation, surpassing prior weakly supervised methods and is competitive to supervised methods. We also obtain state of the art weakly supervised results on PASCAL VOC12 and PF-PASCAL with real-time inference.

[1] Paul A. Viola,et al. Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[2] G LoweDavid,et al. Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[3] Andrew Blake,et al. "GrabCut" , 2004, ACM Trans. Graph..

[4] Steven M. Seitz,et al. Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[5] Tom Drummond,et al. Machine Learning for High-Speed Corner Detection , 2006, ECCV.

[6] Philip A. Knight,et al. The Sinkhorn-Knopp Algorithm: Convergence and Applications , 2008, SIAM J. Matrix Anal. Appl..

[7] Christopher Hunt,et al. Notes on the OpenSURF Library , 2009 .

[8] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[9] Vincent Lepetit,et al. DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Antonio Torralba,et al. SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Pietro Perona,et al. The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[12] Vladlen Koltun,et al. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[13] Siddhartha S. Srinivasa,et al. The MOPED framework: Object recognition and pose estimation for manipulation , 2011, Int. J. Robotics Res..

[14] Deva Ramanan,et al. Analyzing 3D Objects in Cluttered Images , 2012, NIPS.

[15] Deva Ramanan,et al. Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Jitendra Malik,et al. Simultaneous Detection and Segmentation , 2014, ECCV.

[18] Silvio Savarese,et al. Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[19] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20] Jonathan T. Barron,et al. Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Yong Jae Lee,et al. FlowWeb: Joint image set alignment by weaving consistent, pixel-wise correspondences , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23] Jitendra Malik,et al. Viewpoints and keypoints , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Rahul Sukthankar,et al. MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Yi Yang,et al. DenseBox: Unifying Landmark Localization with End to End Object Detection , 2015, ArXiv.

[26] George Papandreou,et al. Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Jian Sun,et al. BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28] Vincent Lepetit,et al. TILDE: A Temporally Invariant Learned DEtector , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Jian Sun,et al. Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Jian Sun,et al. ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Vincent Lepetit,et al. LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[32] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[33] Leonidas J. Guibas,et al. ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[34] Andrea Vedaldi,et al. Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Silvio Savarese,et al. Universal Correspondence Network , 2016, NIPS.

[36] Jean Ponce,et al. Proposal Flow , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Andrea Vedaldi,et al. I Have Seen Enough: Transferring Parts Across Categories , 2016, BMVC.

[38] Yoichi Sato,et al. Joint Recovery of Dense Correspondence and Cosegmentation in Two Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Fei-Fei Li,et al. What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[40] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Jean Ponce,et al. SCNet: Learning Semantic Correspondence , 2017, ICCV.

[42] Yi Li,et al. Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Harri Valpola,et al. Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[44] Sabine Süsstrunk,et al. Webly Supervised Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Xiaowei Zhou,et al. 6-DoF object pose from semantic keypoints , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[46] Paul Vernaza,et al. Learning Random-Walk Label Propagation for Weakly-Supervised Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Bernt Schiele,et al. Simple Does It: Weakly Supervised Instance and Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Georgios Tzimiropoulos,et al. How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49] Min Bai,et al. Deep Watershed Transform for Instance Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Josef Sivic,et al. Convolutional Neural Network Architecture for Geometric Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Matthieu Cord,et al. WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] James M. Rehg,et al. 3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54] Jitendra Malik,et al. Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[55] Ismail Ben Ayed,et al. On Regularized Losses for Weakly-supervised CNN Segmentation , 2018, ECCV.

[56] Dieter Fox,et al. Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects , 2018, CoRL.

[57] Ralph R. Martin,et al. Associating Inter-image Salient Instances for Weakly Supervised Semantic Segmentation , 2018, ECCV.

[58] Josef Sivic,et al. End-to-End Weakly-Supervised Semantic Alignment , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59] Pascal Fua,et al. LF-Net: Learning Local Features from Images , 2018, NeurIPS.

[60] Yuri Boykov,et al. Normalized Cut Loss for Weakly-Supervised CNN Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Qiang Qiu,et al. Weakly Supervised Instance Segmentation Using Class Peak Response , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62] Tomás Pajdla,et al. Neighbourhood Consensus Networks , 2018, NeurIPS.

[63] Cordelia Schmid,et al. Proposal Flow: Semantic Correspondences from Object Proposals , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64] Suha Kwak,et al. Learning Pixel-Level Semantic Affinity with Image-Level Supervision for Weakly Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65] Stephen Lin,et al. Recurrent Transformer Networks for Semantic Correspondence , 2018, NeurIPS.

[66] Xuming He,et al. Dynamic Context Correspondence Network for Semantic Alignment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[67] Shubham Tulsiani,et al. Canonical Surface Mapping via Geometric Cycle Consistency , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68] Yan Huang,et al. Box-Driven Class-Wise Region Masking and Filling Rate Guided Loss for Weakly Supervised Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Andrea Vedaldi,et al. C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70] Yong Jae Lee,et al. YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[71] Xingyi Zhou,et al. Objects as Points , 2019, ArXiv.

[72] Jean Ponce,et al. SFNet: Learning Object-Aware Semantic Correspondence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73] Suha Kwak,et al. Weakly Supervised Learning of Instance Segmentation With Inter-Pixel Relations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74] Jitendra Malik,et al. Mesh R-CNN , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[75] Jean Ponce,et al. Hyperpixel Flow: Semantic Correspondence With Multi-Layer Neural Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[76] Show, Match and Segment: Joint Learning of Semantic Matching and Object Co-segmentation , 2019, ArXiv.

[77] Jian Sun,et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[78] Jean Ponce,et al. SPair-71k: A Large-scale Benchmark for Semantic Correspondence , 2019, ArXiv.

[79] Yunchao Wei,et al. Weakly Supervised Scene Parsing with Point-based Distance Metric Learning , 2018, AAAI.

[80] Hao Chen,et al. FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81] Pavlo Molchanov,et al. SCOPS: Self-Supervised Co-Part Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[82] Yung-Yu Chuang,et al. Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior , 2019, NeurIPS.

[83] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[84] Makoto Yamada,et al. Semantic Correspondence as an Optimal Transport Problem , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[85] Quoc V. Le,et al. Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[86] David Berthelot,et al. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[87] Luc Van Gool,et al. Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation , 2020, ECCV.

[88] Quoc V. Le,et al. EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[89] Ping Luo,et al. PolarMask: Single Shot Instance Segmentation With Polar Representation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[90] SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[91] Jan Kautz,et al. Self-supervised Single-view 3D Reconstruction via Semantic Consistency , 2020, ECCV.

[92] Ambrish Tyagi,et al. Box2Seg: Attention Weighted Loss and Discriminative Feature Learning for Weakly Supervised Segmentation , 2020, ECCV.

[93] Hao Chen,et al. Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[94] Tao Kong,et al. SOLOv2: Dynamic and Fast Instance Segmentation , 2020, NeurIPS.

[95] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[96] Seungryong Kim,et al. Guided Semantic Flow , 2020, ECCV.

[97] Jean Ponce,et al. Learning to Compose Hypercolumns for Visual Correspondence , 2020, ECCV.

[98] C. V. Jawahar,et al. Weakly Supervised Instance Segmentation by Learning Annotation Consistent Instances , 2020, ECCV.

[99] Zhi Tian,et al. BoxInst: High-Performance Instance Segmentation with Box Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[100] Yong Jae Lee,et al. YOLACT++ Better Real-Time Instance Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.