DiscoBox: Weakly Supervised Instance Segmentation and Semantic Correspondence from Box Supervision

We introduce DiscoBox, a novel framework that jointly learns instance segmentation and semantic correspondence using bounding box supervision. Specifically, we propose a self-ensembling framework where instance segmentation and semantic correspondence are jointly guided by a structured teacher in addition to the bounding box supervision. The teacher is a structured energy model incorporating a pairwise potential and a cross-image potential to model the pairwise pixel relationships both within and across the boxes. Minimizing the teacher energy simultaneously yields refined object masks and dense correspondences between intra-class objects, which are taken as pseudo-labels to supervise the task network and provide positive/negative correspondence pairs for dense contrastive learning. We show a symbiotic relationship where the two tasks mutually benefit from each other. Our best model achieves 37.9% AP on COCO instance segmentation, surpassing prior weakly supervised methods and is competitive to supervised methods. We also obtain state of the art weakly supervised results on PASCAL VOC12 and PF-PASCAL with real-time inference.

[1]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[2]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[3]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[4]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[5]  Tom Drummond,et al.  Machine Learning for High-Speed Corner Detection , 2006, ECCV.

[6]  Philip A. Knight,et al.  The Sinkhorn-Knopp Algorithm: Convergence and Applications , 2008, SIAM J. Matrix Anal. Appl..

[7]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[8]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[9]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Antonio Torralba,et al.  SIFT Flow: Dense Correspondence across Scenes and Its Applications , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[12]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[13]  Siddhartha S. Srinivasa,et al.  The MOPED framework: Object recognition and pose estimation for manipulation , 2011, Int. J. Robotics Res..

[14]  Deva Ramanan,et al.  Analyzing 3D Objects in Cluttered Images , 2012, NIPS.

[15]  Deva Ramanan,et al.  Face detection, pose estimation, and landmark localization in the wild , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[18]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Jonathan T. Barron,et al.  Multiscale Combinatorial Grouping , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Yong Jae Lee,et al.  FlowWeb: Joint image set alignment by weaving consistent, pixel-wise correspondences , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Jitendra Malik,et al.  Viewpoints and keypoints , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Rahul Sukthankar,et al.  MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yi Yang,et al.  DenseBox: Unifying Landmark Localization with End to End Object Detection , 2015, ArXiv.

[26]  George Papandreou,et al.  Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Jian Sun,et al.  BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Vincent Lepetit,et al.  TILDE: A Temporally Invariant Learned DEtector , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jian Sun,et al.  Instance-Aware Semantic Segmentation via Multi-task Network Cascades , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jian Sun,et al.  ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[32]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[33]  Leonidas J. Guibas,et al.  ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[34]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Silvio Savarese,et al.  Universal Correspondence Network , 2016, NIPS.

[36]  Jean Ponce,et al.  Proposal Flow , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Andrea Vedaldi,et al.  I Have Seen Enough: Transferring Parts Across Categories , 2016, BMVC.

[38]  Yoichi Sato,et al.  Joint Recovery of Dense Correspondence and Cosegmentation in Two Images , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[40]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jean Ponce,et al.  SCNet: Learning Semantic Correspondence , 2017, ICCV.

[42]  Yi Li,et al.  Fully Convolutional Instance-Aware Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[44]  Sabine Süsstrunk,et al.  Webly Supervised Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Xiaowei Zhou,et al.  6-DoF object pose from semantic keypoints , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Paul Vernaza,et al.  Learning Random-Walk Label Propagation for Weakly-Supervised Semantic Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Bernt Schiele,et al.  Simple Does It: Weakly Supervised Instance and Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Min Bai,et al.  Deep Watershed Transform for Instance Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Josef Sivic,et al.  Convolutional Neural Network Architecture for Geometric Matching , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Matthieu Cord,et al.  WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  James M. Rehg,et al.  3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Jitendra Malik,et al.  Learning Category-Specific Mesh Reconstruction from Image Collections , 2018, ECCV.

[55]  Ismail Ben Ayed,et al.  On Regularized Losses for Weakly-supervised CNN Segmentation , 2018, ECCV.

[56]  Dieter Fox,et al.  Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects , 2018, CoRL.

[57]  Ralph R. Martin,et al.  Associating Inter-image Salient Instances for Weakly Supervised Semantic Segmentation , 2018, ECCV.

[58]  Josef Sivic,et al.  End-to-End Weakly-Supervised Semantic Alignment , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Pascal Fua,et al.  LF-Net: Learning Local Features from Images , 2018, NeurIPS.

[60]  Yuri Boykov,et al.  Normalized Cut Loss for Weakly-Supervised CNN Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61]  Qiang Qiu,et al.  Weakly Supervised Instance Segmentation Using Class Peak Response , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Tomás Pajdla,et al.  Neighbourhood Consensus Networks , 2018, NeurIPS.

[63]  Cordelia Schmid,et al.  Proposal Flow: Semantic Correspondences from Object Proposals , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Suha Kwak,et al.  Learning Pixel-Level Semantic Affinity with Image-Level Supervision for Weakly Supervised Semantic Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Stephen Lin,et al.  Recurrent Transformer Networks for Semantic Correspondence , 2018, NeurIPS.

[66]  Xuming He,et al.  Dynamic Context Correspondence Network for Semantic Alignment , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[67]  Shubham Tulsiani,et al.  Canonical Surface Mapping via Geometric Cycle Consistency , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[68]  Yan Huang,et al.  Box-Driven Class-Wise Region Masking and Filling Rate Guided Loss for Weakly Supervised Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Andrea Vedaldi,et al.  C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  Yong Jae Lee,et al.  YOLACT: Real-Time Instance Segmentation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[72]  Jean Ponce,et al.  SFNet: Learning Object-Aware Semantic Correspondence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Suha Kwak,et al.  Weakly Supervised Learning of Instance Segmentation With Inter-Pixel Relations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Jitendra Malik,et al.  Mesh R-CNN , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[75]  Jean Ponce,et al.  Hyperpixel Flow: Semantic Correspondence With Multi-Layer Neural Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[76]  Show, Match and Segment: Joint Learning of Semantic Matching and Object Co-segmentation , 2019, ArXiv.

[77]  Jian Sun,et al.  Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[78]  Jean Ponce,et al.  SPair-71k: A Large-scale Benchmark for Semantic Correspondence , 2019, ArXiv.

[79]  Yunchao Wei,et al.  Weakly Supervised Scene Parsing with Point-based Distance Metric Learning , 2018, AAAI.

[80]  Hao Chen,et al.  FCOS: Fully Convolutional One-Stage Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[81]  Pavlo Molchanov,et al.  SCOPS: Self-Supervised Co-Part Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[82]  Yung-Yu Chuang,et al.  Weakly Supervised Instance Segmentation using the Bounding Box Tightness Prior , 2019, NeurIPS.

[83]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[84]  Makoto Yamada,et al.  Semantic Correspondence as an Optimal Transport Problem , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[87]  Luc Van Gool,et al.  Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation , 2020, ECCV.

[88]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Ping Luo,et al.  PolarMask: Single Shot Instance Segmentation With Polar Representation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[91]  Jan Kautz,et al.  Self-supervised Single-view 3D Reconstruction via Semantic Consistency , 2020, ECCV.

[92]  Ambrish Tyagi,et al.  Box2Seg: Attention Weighted Loss and Discriminative Feature Learning for Weakly Supervised Segmentation , 2020, ECCV.

[93]  Hao Chen,et al.  Conditional Convolutions for Instance Segmentation , 2020, ECCV.

[94]  Tao Kong,et al.  SOLOv2: Dynamic and Fast Instance Segmentation , 2020, NeurIPS.

[95]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Seungryong Kim,et al.  Guided Semantic Flow , 2020, ECCV.

[97]  Jean Ponce,et al.  Learning to Compose Hypercolumns for Visual Correspondence , 2020, ECCV.

[98]  C. V. Jawahar,et al.  Weakly Supervised Instance Segmentation by Learning Annotation Consistent Instances , 2020, ECCV.

[99]  Zhi Tian,et al.  BoxInst: High-Performance Instance Segmentation with Box Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[100]  Yong Jae Lee,et al.  YOLACT++ Better Real-Time Instance Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.