Learning Intelligent Dialogs for Bounding Box Annotation

We introduce Intelligent Annotation Dialogs for bounding box annotation. We train an agent to automatically choose a sequence of actions for a human annotator to produce a bounding box in a minimal amount of time. Specifically, we consider two actions: box verification [34], where the annotator verifies a box generated by an object detector, and manual box drawing. We explore two kinds of agents, one based on predicting the probability that a box will be positively verified, and the other based on reinforcement learning. We demonstrate that (1) our agents are able to learn efficient annotation strategies in several scenarios, automatically adapting to the image difficulty, the desired quality of the boxes, and the detector strength; (2) in all scenarios the resulting annotation dialogs speed up annotation compared to manual box drawing alone and box verification alone, while also outperforming any fixed combination of verification and drawing in most scenarios; (3) in a realistic scenario where the detector is iteratively re-trained, our agents evolve a series of strategies that reflect the shifting trade-off between verification and drawing as the detector grows stronger.

[1]  Frank Keller,et al.  We Don’t Need No Bounding-Boxes: Training Object Class Detectors Using Only Human Verification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Kristen Grauman,et al.  Multi-Level Active Prediction of Useful Image Annotations for Recognition , 2008, NIPS.

[3]  Sanja Fidler,et al.  Annotating Object Instances with a Polygon-RNN , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Pietro Perona,et al.  Active Annotation Translation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Frank Keller,et al.  Extreme Clicking for Efficient Object Annotation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Marie-Pierre Jolly,et al.  Interactive graph cuts for optimal boundary & region segmentation of objects in N-D images , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[7]  Hsuan-Tien Lin,et al.  Active Learning by Learning , 2015, AAAI.

[8]  Deva Ramanan,et al.  Efficiently Scaling up Crowdsourced Video Annotation , 2012, International Journal of Computer Vision.

[9]  Tao Xiang,et al.  Weakly supervised object detector learning with model drift detection , 2011, 2011 International Conference on Computer Vision.

[10]  Kristen Grauman,et al.  What's it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Frank Keller,et al.  Training Object Class Detectors with Click Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[13]  Cordelia Schmid,et al.  Multi-fold MIL Training for Weakly Supervised Object Localization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Bernt Schiele,et al.  RALF: A reinforced active learning formulation for object class recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Luc Van Gool,et al.  Interactive object detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Kristen Grauman,et al.  Click Carving: Segmenting Objects in Video with Point Clicks , 2016, HCOMP.

[17]  Vladimir Kolmogorov,et al.  "GrabCut": interactive foreground extraction using iterated graph cuts , 2004, ACM Trans. Graph..

[18]  T. Tuytelaars,et al.  Weakly Supervised Object Detection with Posterior Regularization , 2014 .

[19]  Svetlana Lazebnik,et al.  Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[21]  Yi Zhu,et al.  Soft Proposal Networks for Weakly Supervised Object Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Minoru Asada,et al.  Purposive Behavior Acquisition for a Real Robot by Vision-Based Reinforcement Learning , 2005, Machine Learning.

[23]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[24]  Abhinav Gupta,et al.  Beyond active noun tagging: Modeling contextual interactions for multi-class active learning , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Hao Su,et al.  Crowdsourcing Annotations for Visual Object Detection , 2012, HCOMP@AAAI.

[26]  Jia Xu,et al.  Learning to segment under various forms of weak supervision , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Miriam Bellver,et al.  Hierarchical Object Detection with Deep Reinforcement Learning , 2016, NIPS 2016.

[28]  Andrea Vedaldi,et al.  Weakly Supervised Deep Detection Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Adriana Kovashka,et al.  Actively selecting annotations among objects and attributes , 2011, 2011 International Conference on Computer Vision.

[30]  Ashutosh Saxena,et al.  High speed obstacle avoidance using monocular vision and reinforcement learning , 2005, ICML.

[31]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[32]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[33]  Nikolaos Papanikolopoulos,et al.  Multi-class active learning for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Marie-Pierre Jolly,et al.  Interactive Graph Cuts for Optimal Boundary and Region Segmentation of Objects in N-D Images , 2001, ICCV.

[35]  Cees Snoek,et al.  Spot On: Action Localization from Pointly-Supervised Proposals , 2016, ECCV.

[36]  Jian Sun,et al.  ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Frank Keller,et al.  Training Object Class Detectors from Eye Tracking Data , 2014, ECCV.

[38]  Kristen Grauman,et al.  Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds , 2011, CVPR 2011.

[39]  Devi Parikh,et al.  Attributes for Classifier Feedback , 2012, ECCV.

[40]  Kristen Grauman,et al.  Learning to look around , 2017, ArXiv.

[41]  Kristen Grauman,et al.  Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Ivan Laptev,et al.  ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization , 2016, ECCV.

[43]  Thomas Deselaers,et al.  Localizing Objects While Learning Their Appearance , 2010, ECCV.

[44]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Arijit Biswas,et al.  Simultaneous Active Learning of Classifiers & Attributes via Relative Feedback , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Cristian Sminchisescu,et al.  Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition , 2012, ECCV.

[47]  Subhransu Maji,et al.  Similarity Comparisons for Interactive Fine-Grained Categorization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[49]  Fei-Fei Li,et al.  Best of both worlds: Human-machine collaboration for object annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Bo Han,et al.  TouchCut: Fast image and video segmentation using single-touch interaction , 2014, Comput. Vis. Image Underst..

[52]  Fei-Fei Li,et al.  What's the Point: Semantic Segmentation with Point Supervision , 2015, ECCV.

[53]  Noah Snavely,et al.  Material recognition in the wild with the Materials in Context Database , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Melih Kandemir,et al.  Variational Bayesian Multiple Instance Learning with Gaussian Processes , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Thomas Brox,et al.  Video Segmentation with Just a Few Strokes , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[56]  Pietro Perona,et al.  Visual Recognition with Humans in the Loop , 2010, ECCV.

[57]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[58]  Pascal Fua,et al.  Learning Active Learning from Data , 2017, NIPS.