Extreme Clicking for Efficient Object Annotation

Manually annotating object bounding boxes is central to building computer vision datasets, and it is very time consuming (annotating ILSVRC [53] took 35s for one high-quality box [62]). It involves clicking on imaginary comers of a tight box around the object. This is difficult as these comers are often outside the actual object and several adjustments are required to obtain a tight box. We propose extreme clicking instead: we ask the annotator to click on four physical points on the object: the top, bottom, left- and right-most points. This task is more natural and these points are easy to find. We crowd-source extreme point annotations for PASCAL VOC 2007 and 2012 and show that (1) annotation time is only 7s per box, 5 × faster than the traditional way of drawing boxes [62]: (2) the quality of the boxes is as good as the original ground-truth drawn the traditional way: (3) detectors trained on our annotations are as accurate as those trained on the original ground-truth. Moreover, our extreme clicking strategy not only yields box coordinates, but also four accurate boundary points. We show (4) how to incorporate them into GrabCut to obtain more accurate segmentations than those delivered when initializing it from bounding boxes: (5) semantic segmentations models trained on these segmentations outperform those trained on segmentations derived from bounding boxes.

[1]  Ali Farhadi,et al.  The benefits and challenges of collecting richer object annotations , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Andrew Zisserman,et al.  Progressive search space reduction for human pose estimation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Krista A. Ehinger,et al.  Modelling search for people in 900 scenes: A combined source model of eye guidance , 2009 .

[5]  Mark Everingham,et al.  Learning effective human pose estimation from inaccurate annotation , 2011, CVPR 2011.

[6]  Cordelia Schmid,et al.  Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Adriana Kovashka,et al.  Discovering Attribute Shades of Meaning with the Crowd , 2014, International Journal of Computer Vision.

[8]  Marie-Pierre Jolly,et al.  Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images , 2001, ICCV.

[9]  Vladimir Kolmogorov,et al.  An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision , 2001, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Tao Zhang,et al.  Interactive graph cut based segmentation with shape priors , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Cristian Sminchisescu,et al.  Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths , 2013, NIPS.

[12]  Kristen Grauman,et al.  Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Guillermo Sapiro,et al.  Geodesic Matting: A Framework for Fast Interactive Image and Video Segmentation and Matting , 2009, International Journal of Computer Vision.

[14]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[17]  Frank Keller,et al.  Training Object Class Detectors from Eye Tracking Data , 2014, ECCV.

[18]  David A. Forsyth,et al.  Utility data annotation with Amazon Mechanical Turk , 2008, 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Tao Xiang,et al.  Weakly supervised object detector learning with model drift detection , 2011, 2011 International Conference on Computer Vision.

[20]  RamananDeva,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2013 .

[21]  Chong Wang,et al.  Large-Scale Weakly Supervised Object Localization via Latent Category Learning , 2015, IEEE Transactions on Image Processing.

[22]  Vladimir Kolmogorov,et al.  Graph cut based image segmentation with connectivity priors , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Thomas Deselaers,et al.  Weakly Supervised Localization and Learning with Generic Knowledge , 2012, International Journal of Computer Vision.

[24]  G. D. Logan Task Switching , 2022 .

[25]  Zhuowen Tu,et al.  MILCut: A Sweeping Line Multiple Instance Learning Paradigm for Interactive Image Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Charless C. Fowlkes,et al.  Contour Detection and Hierarchical Image Segmentation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[28]  Michael S. Bernstein,et al.  A Glimpse Far into the Future: Understanding Long-term Crowd Worker Accuracy , 2016, ArXiv.

[29]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[30]  Frank Keller,et al.  We Don’t Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Cordelia Schmid,et al.  Analysing Domain Shift Factors between Videos and Images for Object Detection , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[33]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[34]  Fei-Fei Li,et al.  Object-Centric Spatial Pooling for Image Classification , 2012, ECCV.

[35]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[36]  Jean Ponce,et al.  Segmentation by transduction , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Kristen Grauman,et al.  Click Carving: Segmenting Objects in Video with Point Clicks , 2016, HCOMP.

[38]  George Papandreou,et al.  Weakly-and Semi-Supervised Learning of a Deep Convolutional Network for Semantic Image Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Andrew Blake,et al.  Geodesic star convexity for interactive image segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[40]  C. Lawrence Zitnick,et al.  Structured Forests for Fast Edge Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[41]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[42]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[43]  Patrick Pérez,et al.  Interactive Image Segmentation Using an Adaptive GMMRF Model , 2004, ECCV.

[44]  Vittorio Ferrari,et al.  Figure-ground segmentation by transferring window masks , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Yong Jae Lee,et al.  Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[47]  S. Kosslyn,et al.  Topographical representations of mental images in primary visual cortex , 1995, Nature.

[48]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[49]  Scott Cohen,et al.  Geodesic graph cut for interactive image segmentation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[50]  Fei-Fei Li,et al.  Best of both worlds: Human-machine collaboration for object annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Cordelia Schmid,et al.  Learning object class detectors from weakly annotated video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[53]  Bo Han,et al.  TouchCut: Fast image and video segmentation using single-touch interaction , 2014, Comput. Vis. Image Underst..

[54]  Cordelia Schmid,et al.  Weakly Supervised Object Localization with Multi-Fold Multiple Instance Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  R. Shepard,et al.  Mental Rotation of Three-Dimensional Objects , 1971, Science.

[56]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[57]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[59]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[60]  Michael S. Bernstein,et al.  A Glimpse Far into the Future: Understanding Long-term Crowd Worker Quality , 2016, CSCW.

[61]  D. Meyer,et al.  Executive control of cognitive processes in task switching. , 2001, Journal of experimental psychology. Human perception and performance.

[62]  Michael F. Cohen,et al.  An iterative optimization approach for unified image segmentation and matting , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[63]  Bodo Rosenhahn,et al.  Expanding object detector's Horizon: Incremental learning framework for object detection in videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Marie-Pierre Jolly,et al.  Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[65]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Olga Veksler,et al.  Star Shape Prior for Graph-Cut Image Segmentation , 2008, ECCV.

[67]  Wenxian Yang,et al.  User-Friendly Interactive Image Segmentation Through Unified Combinatorial User Inputs , 2010, IEEE Transactions on Image Processing.

[68]  Leo Grady,et al.  Random Walks for Image Segmentation , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[70]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[71]  Toby Sharp,et al.  Image segmentation with a bounding box prior , 2009, 2009 IEEE 12th International Conference on Computer Vision.