Scaling up instance annotation via label propagation

Manually annotating object segmentation masks is very time-consuming. While interactive segmentation methods offer a more efficient alternative, they become unaffordable at a large scale because the cost grows linearly with the number of annotated masks. In this paper, we propose a highly efficient annotation scheme for building large datasets with object segmentation masks. At a large scale, images contain many object instances with similar appearance. We exploit these similarities by using hierarchical clustering on mask predictions made by a segmentation model. We propose a scheme that efficiently searches through the hierarchy of clusters and selects which clusters to annotate. Humans manually verify only a few masks per cluster, and the labels are propagated to the whole cluster. Through a large-scale experiment to populate 1M unlabeled images with object segmentation masks for 80 object classes, we show that (1) we obtain 1M object segmentation masks with an total annotation time of only 290 hours; (2) we reduce annotation time by 76× compared to manual annotation; (3) the segmentation quality of our masks is on par with those from manually annotated datasets. Code, data, and models are available online1.

[1]  Trevor Darrell,et al.  Variational Adversarial Active Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[3]  Kristen Grauman,et al.  Active Image Segmentation Propagation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Frank Keller,et al.  Training Object Class Detectors with Click Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[7]  Jian Sun,et al.  ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Leonidas J. Guibas,et al.  A scalable active framework for region annotation in 3D shape collections , 2016, ACM Trans. Graph..

[9]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[11]  Ross B. Girshick,et al.  LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[13]  Sanja Fidler,et al.  Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++ , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Yongchao Gong,et al.  Mask Scoring R-CNN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ali Farhadi,et al.  The benefits and challenges of collecting richer object annotations , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[18]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[19]  Seungbum Hong,et al.  Semi-Supervised Object Detection With Sparsely Annotated Dataset , 2020, 2021 IEEE International Conference on Image Processing (ICIP).

[20]  Xiangyu Zhang,et al.  Co-mining: Self-Supervised Learning for Sparsely Annotated Object Detection , 2020, AAAI.

[21]  Zhuwen Li,et al.  Interactive Image Segmentation with Latent Diversity , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  In So Kweon,et al.  Learning Loss for Active Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[24]  Peter Kontschieder,et al.  The Mapillary Vistas Dataset for Semantic Understanding of Street Scenes , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Luc Van Gool,et al.  Deep Extreme Cut: From Extreme Points to Object Segmentation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Sanja Fidler,et al.  Fast Interactive Object Annotation With Curve-GCN , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Takuya Akiba,et al.  Sampling Techniques for Large-Scale Object Detection From Sparsely Annotated Objects , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Toby Sharp,et al.  Image segmentation with a bounding box prior , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[29]  Matthieu Guillaumin,et al.  Segmentation Propagation in ImageNet , 2012, ECCV.

[30]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Adriana Kovashka,et al.  Actively selecting annotations among objects and attributes , 2011, 2011 International Conference on Computer Vision.

[32]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[33]  Kristen Grauman,et al.  Multi-Level Active Prediction of Useful Image Annotations for Recognition , 2008, NIPS.

[34]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[35]  Frank Keller,et al.  Extreme Clicking for Efficient Object Annotation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[38]  Carsten Rother,et al.  Panoptic Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  William T. Freeman,et al.  Annotation Propagation in Large Image Databases via Dense Image Correspondence , 2012, ECCV.

[40]  Trevor Darrell,et al.  Active Learning with Gaussian Processes for Object Categorization , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[41]  Charless C. Fowlkes,et al.  Do We Need More Training Data? , 2015, International Journal of Computer Vision.

[42]  Trevor Darrell,et al.  BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling , 2018, ArXiv.

[43]  Rodrigo Benenson,et al.  Large-Scale Interactive Object Segmentation With Human Annotators , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Eirikur Agustsson,et al.  Interactive Full Image Segmentation by Considering All Regions Jointly , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[46]  Marie-Pierre Jolly,et al.  Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images , 2001, ICCV.

[47]  Mei Han,et al.  A hierarchical conditional random field model for labeling and segmenting images of street scenes , 2011, CVPR 2011.

[48]  Matthieu Guillaumin,et al.  ImageNet Auto-Annotation with Segmentation Propagation , 2014, International Journal of Computer Vision.

[49]  Yuandong Tian,et al.  A Face Annotation Framework with Partial Clustering and Interactive Labeling , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Abhinav Gupta,et al.  Beyond active noun tagging: Modeling contextual interactions for multi-class active learning , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[52]  Nikolaos Papanikolopoulos,et al.  Multi-class active learning for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Jan Kautz,et al.  Hierarchical Subquery Evaluation for Active Learning on a Graph , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Bruce A. Draper,et al.  Efficient label collection for unlabeled image datasets , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Kristen Grauman,et al.  Click Carving: Segmenting Objects in Video with Point Clicks , 2016, HCOMP.

[56]  Ning Xu,et al.  Deep Interactive Object Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Kaiming He,et al.  PointRend: Image Segmentation As Rendering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Luc Van Gool,et al.  Interactive object detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Sanja Fidler,et al.  Annotating Object Instances with a Polygon-RNN , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Zhuowen Tu,et al.  MILCut: A Sweeping Line Multiple Instance Learning Paradigm for Interactive Image Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Jian Sun,et al.  BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[63]  Kristen Grauman,et al.  Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds , 2011, CVPR 2011.

[64]  Frank Keller,et al.  We Don’t Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Alexei A. Efros,et al.  Few-Shot Segmentation Propagation with Guided Networks , 2018, ArXiv.

[66]  Sudheendra Vijayanarasimhan,et al.  What's it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.